LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54892-54903, 2025.

Abstract

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-shi25b, title = {{L}a{C}ache: Ladder-Shaped {KV} Caching for Efficient Long-Context Modeling of Large Language Models}, author = {Shi, Dachuan and Fu, Yonggan and Yuan, Xiangchi and Yu, Zhongzhi and You, Haoran and Li, Sixu and Dong, Xin and Kautz, Jan and Molchanov, Pavlo and Lin, Yingyan Celine}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {54892--54903}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shi25b/shi25b.pdf}, url = {https://proceedings.mlr.press/v267/shi25b.html}, abstract = {Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.} }
Endnote
%0 Conference Paper %T LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models %A Dachuan Shi %A Yonggan Fu %A Xiangchi Yuan %A Zhongzhi Yu %A Haoran You %A Sixu Li %A Xin Dong %A Jan Kautz %A Pavlo Molchanov %A Yingyan Celine Lin %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-shi25b %I PMLR %P 54892--54903 %U https://proceedings.mlr.press/v267/shi25b.html %V 267 %X Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.
APA
Shi, D., Fu, Y., Yuan, X., Yu, Z., You, H., Li, S., Dong, X., Kautz, J., Molchanov, P. & Lin, Y.C.. (2025). LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54892-54903 Available from https://proceedings.mlr.press/v267/shi25b.html.

Related Material