Benign Overfitting in Token Selection of Attention Mechanism

Keitaro Sakamoto, Issei Sato
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52644-52727, 2025.

Abstract

Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism, under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves “benign overfitting”, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-sakamoto25a, title = {Benign Overfitting in Token Selection of Attention Mechanism}, author = {Sakamoto, Keitaro and Sato, Issei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52644--52727}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/sakamoto25a/sakamoto25a.pdf}, url = {https://proceedings.mlr.press/v267/sakamoto25a.html}, abstract = {Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism, under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves “benign overfitting”, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.} }
Endnote
%0 Conference Paper %T Benign Overfitting in Token Selection of Attention Mechanism %A Keitaro Sakamoto %A Issei Sato %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-sakamoto25a %I PMLR %P 52644--52727 %U https://proceedings.mlr.press/v267/sakamoto25a.html %V 267 %X Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism, under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves “benign overfitting”, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
APA
Sakamoto, K. & Sato, I.. (2025). Benign Overfitting in Token Selection of Attention Mechanism. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52644-52727 Available from https://proceedings.mlr.press/v267/sakamoto25a.html.

Related Material