Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:30672-30687, 2025.

Abstract

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of 4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-kim25ae, title = {Lexico: Extreme {KV} Cache Compression via Sparse Coding over Universal Dictionaries}, author = {Kim, Junhyuck and Park, Jongho and Cho, Jaewoong and Papailiopoulos, Dimitris}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {30672--30687}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/kim25ae/kim25ae.pdf}, url = {https://proceedings.mlr.press/v267/kim25ae.html}, abstract = {We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of 4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.} }
Endnote
%0 Conference Paper %T Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries %A Junhyuck Kim %A Jongho Park %A Jaewoong Cho %A Dimitris Papailiopoulos %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-kim25ae %I PMLR %P 30672--30687 %U https://proceedings.mlr.press/v267/kim25ae.html %V 267 %X We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of 4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.
APA
Kim, J., Park, J., Cho, J. & Papailiopoulos, D.. (2025). Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:30672-30687 Available from https://proceedings.mlr.press/v267/kim25ae.html.

Related Material