Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot; Adrian Łańcucki; Marcin Chochowski; David Tarjan; Edoardo Ponti

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo Ponti

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:37396-37412, 2024.

Abstract

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to

$\sim 3.7 \times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4

$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key–value eviction policies (H

$_2$ O, TOVA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. We release the DMC code and models at https://github.com/NVIDIA/Megatron-LM/tree/DMC.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-nawrot24a,
  title = 	 {Dynamic Memory Compression: Retrofitting {LLM}s for Accelerated Inference},
  author =       {Nawrot, Piotr and {\L}a\'{n}cucki, Adrian and Chochowski, Marcin and Tarjan, David and Ponti, Edoardo},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {37396--37412},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/nawrot24a/nawrot24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/nawrot24a.html},
  abstract = 	 {Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to $\sim 3.7 \times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key–value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. We release the DMC code and models at https://github.com/NVIDIA/Megatron-LM/tree/DMC.}
}

Endnote

%0 Conference Paper
%T Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
%A Piotr Nawrot
%A Adrian Łańcucki
%A Marcin Chochowski
%A David Tarjan
%A Edoardo Ponti
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-nawrot24a
%I PMLR
%P 37396--37412
%U https://proceedings.mlr.press/v235/nawrot24a.html
%V 235
%X Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to $\sim 3.7 \times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key–value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. We release the DMC code and models at https://github.com/NVIDIA/Megatron-LM/tree/DMC.

APA


Nawrot, P., Łańcucki, A., Chochowski, M., Tarjan, D. & Ponti, E.. (2024). Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:37396-37412 Available from https://proceedings.mlr.press/v235/nawrot24a.html.

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Abstract

Cite this Paper

Related Material