Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Zhuohan Li; Eric Wallace; Sheng Shen; Kevin Lin; Kurt Keutzer; Dan Klein; Joey Gonzalez

Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, Joey Gonzalez

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5958-5968, 2020.

Abstract

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-li20m,
  title = 	 {Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers},
  author =       {Li, Zhuohan and Wallace, Eric and Shen, Sheng and Lin, Kevin and Keutzer, Kurt and Klein, Dan and Gonzalez, Joey},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {5958--5968},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/li20m/li20m.pdf},
  url = 	 {https://proceedings.mlr.press/v119/li20m.html},
  abstract = 	 {Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.}
}

Endnote

%0 Conference Paper
%T Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
%A Zhuohan Li
%A Eric Wallace
%A Sheng Shen
%A Kevin Lin
%A Kurt Keutzer
%A Dan Klein
%A Joey Gonzalez
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-li20m
%I PMLR
%P 5958--5968
%U https://proceedings.mlr.press/v119/li20m.html
%V 119
%X Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

APA

Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D. & Gonzalez, J.. (2020). Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:5958-5968 Available from https://proceedings.mlr.press/v119/li20m.html.

Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Abstract

Cite this Paper

Related Material