Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Surkov Nikita, Ivan Ermakov, Dan Alistarh
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:55451-55473, 2025.

Abstract

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys & Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-shutova25a, title = {Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models}, author = {Shutova, Alina and Malinovskii, Vladimir and Egiazarian, Vage and Kuznedelev, Denis and Mazur, Denis and Nikita, Surkov and Ermakov, Ivan and Alistarh, Dan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {55451--55473}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shutova25a/shutova25a.pdf}, url = {https://proceedings.mlr.press/v267/shutova25a.html}, abstract = {Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys & Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.} }
Endnote
%0 Conference Paper %T Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models %A Alina Shutova %A Vladimir Malinovskii %A Vage Egiazarian %A Denis Kuznedelev %A Denis Mazur %A Surkov Nikita %A Ivan Ermakov %A Dan Alistarh %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-shutova25a %I PMLR %P 55451--55473 %U https://proceedings.mlr.press/v267/shutova25a.html %V 267 %X Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys & Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.
APA
Shutova, A., Malinovskii, V., Egiazarian, V., Kuznedelev, D., Mazur, D., Nikita, S., Ermakov, I. & Alistarh, D.. (2025). Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:55451-55473 Available from https://proceedings.mlr.press/v267/shutova25a.html.

Related Material