Superiority of Multi-Head Attention: A Theoretical Study in Shallow Transformers in In-Context Linear Regression

Yingqian Cui, Jie Ren, Pengfei He, Hui Liu, Jiliang Tang, Yue Xing
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:937-945, 2025.

Abstract

We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing theoretical literature predominantly focuses on providing convergence upper bounds to show that trained transformers with single-/multi-head attention can obtain a good in-context learning performance, our research centers on comparing the exact convergence of single- and multi-head attention more rigorously. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples $D$ increases, the prediction loss using single-/multi-head attention is in $O(1/D)$, and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, our technical framework in calculating the exact convergence further facilitates studying more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-cui25a, title = {Superiority of Multi-Head Attention: A Theoretical Study in Shallow Transformers in In-Context Linear Regression}, author = {Cui, Yingqian and Ren, Jie and He, Pengfei and Liu, Hui and Tang, Jiliang and Xing, Yue}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {937--945}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/cui25a/cui25a.pdf}, url = {https://proceedings.mlr.press/v258/cui25a.html}, abstract = {We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing theoretical literature predominantly focuses on providing convergence upper bounds to show that trained transformers with single-/multi-head attention can obtain a good in-context learning performance, our research centers on comparing the exact convergence of single- and multi-head attention more rigorously. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples $D$ increases, the prediction loss using single-/multi-head attention is in $O(1/D)$, and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, our technical framework in calculating the exact convergence further facilitates studying more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.} }
Endnote
%0 Conference Paper %T Superiority of Multi-Head Attention: A Theoretical Study in Shallow Transformers in In-Context Linear Regression %A Yingqian Cui %A Jie Ren %A Pengfei He %A Hui Liu %A Jiliang Tang %A Yue Xing %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-cui25a %I PMLR %P 937--945 %U https://proceedings.mlr.press/v258/cui25a.html %V 258 %X We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing theoretical literature predominantly focuses on providing convergence upper bounds to show that trained transformers with single-/multi-head attention can obtain a good in-context learning performance, our research centers on comparing the exact convergence of single- and multi-head attention more rigorously. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples $D$ increases, the prediction loss using single-/multi-head attention is in $O(1/D)$, and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, our technical framework in calculating the exact convergence further facilitates studying more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.
APA
Cui, Y., Ren, J., He, P., Liu, H., Tang, J. & Xing, Y.. (2025). Superiority of Multi-Head Attention: A Theoretical Study in Shallow Transformers in In-Context Linear Regression. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:937-945 Available from https://proceedings.mlr.press/v258/cui25a.html.

Related Material