[edit]
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45561-45587, 2025.
Abstract
Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even at initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse in depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications—a common pattern across various architectures—we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse in width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii). Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.