Synthesizer: Rethinking Self-Attention for Transformer Models

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:10183-10192, 2021.

Abstract

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60%$ faster but also improves perplexity by a relative $3.5%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-tay21a, title = {Synthesizer: Rethinking Self-Attention for Transformer Models}, author = {Tay, Yi and Bahri, Dara and Metzler, Donald and Juan, Da-Cheng and Zhao, Zhe and Zheng, Che}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {10183--10192}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/tay21a/tay21a.pdf}, url = {https://proceedings.mlr.press/v139/tay21a.html}, abstract = {The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60%$ faster but also improves perplexity by a relative $3.5%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.} }
Endnote
%0 Conference Paper %T Synthesizer: Rethinking Self-Attention for Transformer Models %A Yi Tay %A Dara Bahri %A Donald Metzler %A Da-Cheng Juan %A Zhe Zhao %A Che Zheng %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-tay21a %I PMLR %P 10183--10192 %U https://proceedings.mlr.press/v139/tay21a.html %V 139 %X The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60%$ faster but also improves perplexity by a relative $3.5%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.
APA
Tay, Y., Bahri, D., Metzler, D., Juan, D., Zhao, Z. & Zheng, C.. (2021). Synthesizer: Rethinking Self-Attention for Transformer Models. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:10183-10192 Available from https://proceedings.mlr.press/v139/tay21a.html.

Related Material