CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration

Haoyun Jiang, Haolin Li, Jianwei Zhang, Fei Huang, Qiang Hu, Minmin Sun, Shuai Xiao, Yong Li, Junyang Lin, Jiangchao Yao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:27569-27585, 2025.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2.72\times$ and accelerates decoding by $2.18\times$ in single-sample inputs, and boosts throughput by $3.96\times$ in batch scenarios.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-jiang25e, title = {{C}ate{KV}: On Sequential Consistency for Long-Context {LLM} Inference Acceleration}, author = {Jiang, Haoyun and Li, Haolin and Zhang, Jianwei and Huang, Fei and Hu, Qiang and Sun, Minmin and Xiao, Shuai and Li, Yong and Lin, Junyang and Yao, Jiangchao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {27569--27585}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jiang25e/jiang25e.pdf}, url = {https://proceedings.mlr.press/v267/jiang25e.html}, abstract = {Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2.72\times$ and accelerates decoding by $2.18\times$ in single-sample inputs, and boosts throughput by $3.96\times$ in batch scenarios.} }
Endnote
%0 Conference Paper %T CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration %A Haoyun Jiang %A Haolin Li %A Jianwei Zhang %A Fei Huang %A Qiang Hu %A Minmin Sun %A Shuai Xiao %A Yong Li %A Junyang Lin %A Jiangchao Yao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-jiang25e %I PMLR %P 27569--27585 %U https://proceedings.mlr.press/v267/jiang25e.html %V 267 %X Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2.72\times$ and accelerates decoding by $2.18\times$ in single-sample inputs, and boosts throughput by $3.96\times$ in batch scenarios.
APA
Jiang, H., Li, H., Zhang, J., Huang, F., Hu, Q., Sun, M., Xiao, S., Li, Y., Lin, J. & Yao, J.. (2025). CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:27569-27585 Available from https://proceedings.mlr.press/v267/jiang25e.html.

Related Material