Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:28063-28096, 2025.

Abstract

Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The code is available at https://github.com/MingyuJ666/Rope_with_LLM.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-jin25f, title = {Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding}, author = {Jin, Mingyu and Mei, Kai and Xu, Wujiang and Sun, Mingjie and Tang, Ruixiang and Du, Mengnan and Liu, Zirui and Zhang, Yongfeng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {28063--28096}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jin25f/jin25f.pdf}, url = {https://proceedings.mlr.press/v267/jin25f.html}, abstract = {Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The code is available at https://github.com/MingyuJ666/Rope_with_LLM.} }
Endnote
%0 Conference Paper %T Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding %A Mingyu Jin %A Kai Mei %A Wujiang Xu %A Mingjie Sun %A Ruixiang Tang %A Mengnan Du %A Zirui Liu %A Yongfeng Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-jin25f %I PMLR %P 28063--28096 %U https://proceedings.mlr.press/v267/jin25f.html %V 267 %X Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The code is available at https://github.com/MingyuJ666/Rope_with_LLM.
APA
Jin, M., Mei, K., Xu, W., Sun, M., Tang, R., Du, M., Liu, Z. & Zhang, Y.. (2025). Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:28063-28096 Available from https://proceedings.mlr.press/v267/jin25f.html.

Related Material