Transformer Quality in Linear Time

Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc Le
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9099-9117, 2022.

Abstract

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-hua22a, title = {Transformer Quality in Linear Time}, author = {Hua, Weizhe and Dai, Zihang and Liu, Hanxiao and Le, Quoc}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {9099--9117}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/hua22a/hua22a.pdf}, url = {https://proceedings.mlr.press/v162/hua22a.html}, abstract = {We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling.} }
Endnote
%0 Conference Paper %T Transformer Quality in Linear Time %A Weizhe Hua %A Zihang Dai %A Hanxiao Liu %A Quoc Le %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-hua22a %I PMLR %P 9099--9117 %U https://proceedings.mlr.press/v162/hua22a.html %V 162 %X We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling.
APA
Hua, W., Dai, Z., Liu, H. & Le, Q.. (2022). Transformer Quality in Linear Time. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:9099-9117 Available from https://proceedings.mlr.press/v162/hua22a.html.

Related Material