Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc Le, Qiang Liu, Dale Schuurmans
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:11546-11555, 2020.

Abstract

For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. First, we sufficiently widen the deep thin network and train it until convergence. Then, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by layerwise imitation, that is, forcing the thin network to mimic the intermediate outputs of the wide network from layer to layer. Finally, we further fine tune this already well-initialized deep thin network. The theoretical guarantee is established by using the neural mean field analysis. It demonstrates the advantage of our layerwise imitation approach over backpropagation. We also conduct large-scale empirical experiments to validate the proposed method. By training with our method, ResNet50 can outperform ResNet101, and BERT base can be comparable with BERT large, when ResNet101 and BERT large are trained under the standard training procedures as in the literature.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-zhou20f, title = {Go Wide, Then Narrow: Efficient Training of Deep Thin Networks}, author = {Zhou, Denny and Ye, Mao and Chen, Chen and Meng, Tianjian and Tan, Mingxing and Song, Xiaodan and Le, Quoc and Liu, Qiang and Schuurmans, Dale}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {11546--11555}, year = {2020}, editor = {Hal Daumé III and Aarti Singh}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/zhou20f/zhou20f.pdf}, url = { http://proceedings.mlr.press/v119/zhou20f.html }, abstract = {For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. First, we sufficiently widen the deep thin network and train it until convergence. Then, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by layerwise imitation, that is, forcing the thin network to mimic the intermediate outputs of the wide network from layer to layer. Finally, we further fine tune this already well-initialized deep thin network. The theoretical guarantee is established by using the neural mean field analysis. It demonstrates the advantage of our layerwise imitation approach over backpropagation. We also conduct large-scale empirical experiments to validate the proposed method. By training with our method, ResNet50 can outperform ResNet101, and BERT base can be comparable with BERT large, when ResNet101 and BERT large are trained under the standard training procedures as in the literature.} }
Endnote
%0 Conference Paper %T Go Wide, Then Narrow: Efficient Training of Deep Thin Networks %A Denny Zhou %A Mao Ye %A Chen Chen %A Tianjian Meng %A Mingxing Tan %A Xiaodan Song %A Quoc Le %A Qiang Liu %A Dale Schuurmans %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-zhou20f %I PMLR %P 11546--11555 %U http://proceedings.mlr.press/v119/zhou20f.html %V 119 %X For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. First, we sufficiently widen the deep thin network and train it until convergence. Then, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by layerwise imitation, that is, forcing the thin network to mimic the intermediate outputs of the wide network from layer to layer. Finally, we further fine tune this already well-initialized deep thin network. The theoretical guarantee is established by using the neural mean field analysis. It demonstrates the advantage of our layerwise imitation approach over backpropagation. We also conduct large-scale empirical experiments to validate the proposed method. By training with our method, ResNet50 can outperform ResNet101, and BERT base can be comparable with BERT large, when ResNet101 and BERT large are trained under the standard training procedures as in the literature.
APA
Zhou, D., Ye, M., Chen, C., Meng, T., Tan, M., Song, X., Le, Q., Liu, Q. & Schuurmans, D.. (2020). Go Wide, Then Narrow: Efficient Training of Deep Thin Networks. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:11546-11555 Available from http://proceedings.mlr.press/v119/zhou20f.html .

Related Material