In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Jianliang He, Xintian Pan, Siyu Chen, Zhuoran Yang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22686-22742, 2025.

Abstract

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query weights, and a last-entry-only and zero-sum pattern in the output-value weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor — one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-he25q, title = {In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention}, author = {He, Jianliang and Pan, Xintian and Chen, Siyu and Yang, Zhuoran}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22686--22742}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/he25q/he25q.pdf}, url = {https://proceedings.mlr.press/v267/he25q.html}, abstract = {We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query weights, and a last-entry-only and zero-sum pattern in the output-value weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor — one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.} }
Endnote
%0 Conference Paper %T In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention %A Jianliang He %A Xintian Pan %A Siyu Chen %A Zhuoran Yang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-he25q %I PMLR %P 22686--22742 %U https://proceedings.mlr.press/v267/he25q.html %V 267 %X We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query weights, and a last-entry-only and zero-sum pattern in the output-value weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor — one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.
APA
He, J., Pan, X., Chen, S. & Yang, Z.. (2025). In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22686-22742 Available from https://proceedings.mlr.press/v267/he25q.html.

Related Material