Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers

Thiziri Nait Saada, Alireza Naderi, Jared Tanner
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45561-45587, 2025.

Abstract

Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even at initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse in depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications—a common pattern across various architectures—we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse in width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii). Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-nait-saada25a, title = {Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers}, author = {Nait Saada, Thiziri and Naderi, Alireza and Tanner, Jared}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {45561--45587}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nait-saada25a/nait-saada25a.pdf}, url = {https://proceedings.mlr.press/v267/nait-saada25a.html}, abstract = {Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even at initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse in depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications—a common pattern across various architectures—we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse in width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii). Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.} }
Endnote
%0 Conference Paper %T Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers %A Thiziri Nait Saada %A Alireza Naderi %A Jared Tanner %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-nait-saada25a %I PMLR %P 45561--45587 %U https://proceedings.mlr.press/v267/nait-saada25a.html %V 267 %X Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even at initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse in depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications—a common pattern across various architectures—we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse in width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii). Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.
APA
Nait Saada, T., Naderi, A. & Tanner, J.. (2025). Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:45561-45587 Available from https://proceedings.mlr.press/v267/nait-saada25a.html.

Related Material