On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures

Wei Shen, Ruida Zhou, Jing Yang, Cong Shen
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54732-54771, 2025.

Abstract

Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-shen25q, title = {On the Training Convergence of Transformers for In-Context Classification of {G}aussian Mixtures}, author = {Shen, Wei and Zhou, Ruida and Yang, Jing and Shen, Cong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {54732--54771}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shen25q/shen25q.pdf}, url = {https://proceedings.mlr.press/v267/shen25q.html}, abstract = {Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.} }
Endnote
%0 Conference Paper %T On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures %A Wei Shen %A Ruida Zhou %A Jing Yang %A Cong Shen %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-shen25q %I PMLR %P 54732--54771 %U https://proceedings.mlr.press/v267/shen25q.html %V 267 %X Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.
APA
Shen, W., Zhou, R., Yang, J. & Shen, C.. (2025). On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54732-54771 Available from https://proceedings.mlr.press/v267/shen25q.html.

Related Material