Curse of High Dimensionality Issue in Transformer for Long Context Modeling

Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76600-76624, 2025.

Abstract

Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25cr, title = {Curse of High Dimensionality Issue in Transformer for Long Context Modeling}, author = {Zhang, Shuhai and You, Zeng and Chen, Yaofo and Wen, Zhiquan and Wang, Qianyue and Qiu, Zhijie and Li, Yuanqing and Tan, Mingkui}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {76600--76624}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25cr/zhang25cr.pdf}, url = {https://proceedings.mlr.press/v267/zhang25cr.html}, abstract = {Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.} }
Endnote
%0 Conference Paper %T Curse of High Dimensionality Issue in Transformer for Long Context Modeling %A Shuhai Zhang %A Zeng You %A Yaofo Chen %A Zhiquan Wen %A Qianyue Wang %A Zhijie Qiu %A Yuanqing Li %A Mingkui Tan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25cr %I PMLR %P 76600--76624 %U https://proceedings.mlr.press/v267/zhang25cr.html %V 267 %X Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.
APA
Zhang, S., You, Z., Chen, Y., Wen, Z., Wang, Q., Qiu, Z., Li, Y. & Tan, M.. (2025). Curse of High Dimensionality Issue in Transformer for Long Context Modeling. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76600-76624 Available from https://proceedings.mlr.press/v267/zhang25cr.html.

Related Material