Compression Barriers in Autoregressive Transformers

Themistoklis Haris; Krzysztof Onak

Compression Barriers in Autoregressive Transformers

Themistoklis Haris, Krzysztof Onak

Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:2757-2785, 2025.

Abstract

A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(de^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by Zandieh, Han, Mirrokni, and Karbasi (2024), matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

Cite this Paper

BibTeX

@InProceedings{pmlr-v291-haris25a,
  title = 	 {Compression Barriers in Autoregressive Transformers},
  author =       {Haris, Themistoklis and Onak, Krzysztof},
  booktitle = 	 {Proceedings of Thirty Eighth Conference on Learning Theory},
  pages = 	 {2757--2785},
  year = 	 {2025},
  editor = 	 {Haghtalab, Nika and Moitra, Ankur},
  volume = 	 {291},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--04 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v291/main/assets/haris25a/haris25a.pdf},
  url = 	 {https://proceedings.mlr.press/v291/haris25a.html},
  abstract = 	 {A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(de^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by Zandieh, Han, Mirrokni, and Karbasi (2024), matches this bound.  Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked.  Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens. }
}

Endnote

%0 Conference Paper
%T Compression Barriers in Autoregressive Transformers
%A Themistoklis Haris
%A Krzysztof Onak
%B Proceedings of Thirty Eighth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2025
%E Nika Haghtalab
%E Ankur Moitra	
%F pmlr-v291-haris25a
%I PMLR
%P 2757--2785
%U https://proceedings.mlr.press/v291/haris25a.html
%V 291
%X A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(de^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by Zandieh, Han, Mirrokni, and Karbasi (2024), matches this bound.  Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked.  Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

APA

Haris, T. & Onak, K.. (2025). Compression Barriers in Autoregressive Transformers. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:2757-2785 Available from https://proceedings.mlr.press/v291/haris25a.html.

Related Material

Download PDF