Consensus Is All You Get: The Role of Attention in Transformers

Álvaro Rodrı́guez Abella, João Pedro Silvestre, Paulo Tabuada
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:174-184, 2025.

Abstract

A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-abella25a, title = {Consensus Is All You Get: The Role of Attention in Transformers}, author = {Abella, \'{A}lvaro Rodr\'{\i}guez and Silvestre, Jo\~{a}o Pedro and Tabuada, Paulo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {174--184}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/abella25a/abella25a.pdf}, url = {https://proceedings.mlr.press/v267/abella25a.html}, abstract = {A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.} }
Endnote
%0 Conference Paper %T Consensus Is All You Get: The Role of Attention in Transformers %A Álvaro Rodrı́guez Abella %A João Pedro Silvestre %A Paulo Tabuada %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-abella25a %I PMLR %P 174--184 %U https://proceedings.mlr.press/v267/abella25a.html %V 267 %X A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.
APA
Abella, Á.R., Silvestre, J.P. & Tabuada, P.. (2025). Consensus Is All You Get: The Role of Attention in Transformers. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:174-184 Available from https://proceedings.mlr.press/v267/abella25a.html.

Related Material