KV Shifting Attention Enhances Language Modeling

Mingyu Xu, Bingning Wang, Weipeng Chen
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:69725-69748, 2025.

Abstract

Current large language models (LLMs) predominantly rely on decode-only transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model’s induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model’s dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-xu25ac, title = {{KV} Shifting Attention Enhances Language Modeling}, author = {Xu, Mingyu and Wang, Bingning and Chen, Weipeng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {69725--69748}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xu25ac/xu25ac.pdf}, url = {https://proceedings.mlr.press/v267/xu25ac.html}, abstract = {Current large language models (LLMs) predominantly rely on decode-only transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model’s induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model’s dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.} }
Endnote
%0 Conference Paper %T KV Shifting Attention Enhances Language Modeling %A Mingyu Xu %A Bingning Wang %A Weipeng Chen %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-xu25ac %I PMLR %P 69725--69748 %U https://proceedings.mlr.press/v267/xu25ac.html %V 267 %X Current large language models (LLMs) predominantly rely on decode-only transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model’s induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model’s dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.
APA
Xu, M., Wang, B. & Chen, W.. (2025). KV Shifting Attention Enhances Language Modeling. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:69725-69748 Available from https://proceedings.mlr.press/v267/xu25ac.html.

Related Material