Attention is not all you need: pure attention loses rank doubly exponentially with depth

Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:2793-2803, 2021.

Abstract

Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-dong21a, title = {Attention is not all you need: pure attention loses rank doubly exponentially with depth}, author = {Dong, Yihe and Cordonnier, Jean-Baptiste and Loukas, Andreas}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {2793--2803}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/dong21a/dong21a.pdf}, url = {https://proceedings.mlr.press/v139/dong21a.html}, abstract = {Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.} }
Endnote
%0 Conference Paper %T Attention is not all you need: pure attention loses rank doubly exponentially with depth %A Yihe Dong %A Jean-Baptiste Cordonnier %A Andreas Loukas %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-dong21a %I PMLR %P 2793--2803 %U https://proceedings.mlr.press/v139/dong21a.html %V 139 %X Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.
APA
Dong, Y., Cordonnier, J. & Loukas, A.. (2021). Attention is not all you need: pure attention loses rank doubly exponentially with depth. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:2793-2803 Available from https://proceedings.mlr.press/v139/dong21a.html.

Related Material