Attention is not all you need: pure attention loses rank doubly exponentially with depth

Yihe Dong; Jean-Baptiste Cordonnier; Andreas Loukas

Attention is not all you need: pure attention loses rank doubly exponentially with depth

Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:2793-2803, 2021.

Abstract

Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-dong21a,
  title = 	 {Attention is not all you need: pure attention loses rank doubly exponentially with depth},
  author =       {Dong, Yihe and Cordonnier, Jean-Baptiste and Loukas, Andreas},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {2793--2803},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/dong21a/dong21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/dong21a.html},
  abstract = 	 {Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.}
}

Endnote

%0 Conference Paper
%T Attention is not all you need: pure attention loses rank doubly exponentially with depth
%A Yihe Dong
%A Jean-Baptiste Cordonnier
%A Andreas Loukas
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-dong21a
%I PMLR
%P 2793--2803
%U https://proceedings.mlr.press/v139/dong21a.html
%V 139
%X Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.

APA


Dong, Y., Cordonnier, J. & Loukas, A.. (2021). Attention is not all you need: pure attention loses rank doubly exponentially with depth. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:2793-2803 Available from https://proceedings.mlr.press/v139/dong21a.html.

Attention is not all you need: pure attention loses rank doubly exponentially with depth

Abstract

Cite this Paper

Related Material