Brainformers: Trading Simplicity for Efficiency

Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M. Dai, Yifeng Lu, Zhifeng Chen, Quoc V Le, Claire Cui, James Laudon, Jeff Dean
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:42531-42542, 2023.

Abstract

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-zhou23c, title = {Brainformers: Trading Simplicity for Efficiency}, author = {Zhou, Yanqi and Du, Nan and Huang, Yanping and Peng, Daiyi and Lan, Chang and Huang, Da and Shakeri, Siamak and So, David and Dai, Andrew M. and Lu, Yifeng and Chen, Zhifeng and Le, Quoc V and Cui, Claire and Laudon, James and Dean, Jeff}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {42531--42542}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/zhou23c/zhou23c.pdf}, url = {https://proceedings.mlr.press/v202/zhou23c.html}, abstract = {Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.} }
Endnote
%0 Conference Paper %T Brainformers: Trading Simplicity for Efficiency %A Yanqi Zhou %A Nan Du %A Yanping Huang %A Daiyi Peng %A Chang Lan %A Da Huang %A Siamak Shakeri %A David So %A Andrew M. Dai %A Yifeng Lu %A Zhifeng Chen %A Quoc V Le %A Claire Cui %A James Laudon %A Jeff Dean %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-zhou23c %I PMLR %P 42531--42542 %U https://proceedings.mlr.press/v202/zhou23c.html %V 202 %X Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
APA
Zhou, Y., Du, N., Huang, Y., Peng, D., Lan, C., Huang, D., Shakeri, S., So, D., Dai, A.M., Lu, Y., Chen, Z., Le, Q.V., Cui, C., Laudon, J. & Dean, J.. (2023). Brainformers: Trading Simplicity for Efficiency. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:42531-42542 Available from https://proceedings.mlr.press/v202/zhou23c.html.

Related Material