MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition

Wang Ning, Zekun Li, Tongxin Bai, Man Yao, Zhen Qin, Guoqi Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:46473-46497, 2025.

Abstract

Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a high-order QK integration theory and a multi-level vocabulary decomposition. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a soft integration strategy with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ning25b, title = {{MVA}: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition}, author = {Ning, Wang and Li, Zekun and Bai, Tongxin and Yao, Man and Qin, Zhen and Li, Guoqi}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {46473--46497}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ning25b/ning25b.pdf}, url = {https://proceedings.mlr.press/v267/ning25b.html}, abstract = {Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a high-order QK integration theory and a multi-level vocabulary decomposition. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a soft integration strategy with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.} }
Endnote
%0 Conference Paper %T MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition %A Wang Ning %A Zekun Li %A Tongxin Bai %A Man Yao %A Zhen Qin %A Guoqi Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ning25b %I PMLR %P 46473--46497 %U https://proceedings.mlr.press/v267/ning25b.html %V 267 %X Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a high-order QK integration theory and a multi-level vocabulary decomposition. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a soft integration strategy with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.
APA
Ning, W., Li, Z., Bai, T., Yao, M., Qin, Z. & Li, G.. (2025). MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:46473-46497 Available from https://proceedings.mlr.press/v267/ning25b.html.

Related Material