Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11437-11452, 2024.

Abstract

Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient. Relevant code can be found at https://github.com/hdong920/LESS.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-dong24f, title = {Get More with {LESS}: Synthesizing Recurrence with {KV} Cache Compression for Efficient {LLM} Inference}, author = {Dong, Harry and Yang, Xinyu and Zhang, Zhenyu and Wang, Zhangyang and Chi, Yuejie and Chen, Beidi}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {11437--11452}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/dong24f/dong24f.pdf}, url = {https://proceedings.mlr.press/v235/dong24f.html}, abstract = {Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient. Relevant code can be found at https://github.com/hdong920/LESS.} }
Endnote
%0 Conference Paper %T Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference %A Harry Dong %A Xinyu Yang %A Zhenyu Zhang %A Zhangyang Wang %A Yuejie Chi %A Beidi Chen %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-dong24f %I PMLR %P 11437--11452 %U https://proceedings.mlr.press/v235/dong24f.html %V 235 %X Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient. Relevant code can be found at https://github.com/hdong920/LESS.
APA
Dong, H., Yang, X., Zhang, Z., Wang, Z., Chi, Y. & Chen, B.. (2024). Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:11437-11452 Available from https://proceedings.mlr.press/v235/dong24f.html.

Related Material