CaM: Cache Merging for Memory-efficient LLMs Inference

Yuxin Zhang; Yuxuan Du; Gen Luo; Yunshan Zhong; Zhenyu Zhang; Shiwei Liu; Rongrong Ji

CaM: Cache Merging for Memory-efficient LLMs Inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:58840-58850, 2024.

Abstract

Despite the exceptional performance of Large Language Models (LLMs), the substantial volume of key-value (KV) pairs cached during inference presents a barrier to their efficient deployment. To ameliorate this, recent works have aimed to selectively eliminate these caches, informed by the attention scores of associated tokens. However, such cache eviction invariably leads to output perturbation, regardless of the token choice. This perturbation escalates with the compression ratio, which can precipitate a marked deterioration in LLM inference performance. This paper introduces Cache Merging (CaM) as a solution to mitigate this challenge. CaM adaptively merges to-be-evicted caches into the remaining ones, employing a novel sampling strategy governed by the prominence of attention scores within discarded locations. In this manner, CaM enables memory-efficient LLMs to preserve critical token information, even obviating the need to maintain their corresponding caches. Extensive experiments utilizing LLaMA, OPT, and GPT-NeoX across various benchmarks corroborate CaM’s proficiency in bolstering the performance of memory-efficient LLMs. Code is released at https://github.com/zyxxmu/cam.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-zhang24n,
  title = 	 {{C}a{M}: Cache Merging for Memory-efficient {LLM}s Inference},
  author =       {Zhang, Yuxin and Du, Yuxuan and Luo, Gen and Zhong, Yunshan and Zhang, Zhenyu and Liu, Shiwei and Ji, Rongrong},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {58840--58850},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24n/zhang24n.pdf},
  url = 	 {https://proceedings.mlr.press/v235/zhang24n.html},
  abstract = 	 {Despite the exceptional performance of Large Language Models (LLMs), the substantial volume of key-value (KV) pairs cached during inference presents a barrier to their efficient deployment. To ameliorate this, recent works have aimed to selectively eliminate these caches, informed by the attention scores of associated tokens. However, such cache eviction invariably leads to output perturbation, regardless of the token choice. This perturbation escalates with the compression ratio, which can precipitate a marked deterioration in LLM inference performance. This paper introduces Cache Merging (CaM) as a solution to mitigate this challenge. CaM adaptively merges to-be-evicted caches into the remaining ones, employing a novel sampling strategy governed by the prominence of attention scores within discarded locations. In this manner, CaM enables memory-efficient LLMs to preserve critical token information, even obviating the need to maintain their corresponding caches. Extensive experiments utilizing LLaMA, OPT, and GPT-NeoX across various benchmarks corroborate CaM’s proficiency in bolstering the performance of memory-efficient LLMs. Code is released at https://github.com/zyxxmu/cam.}
}

Endnote

%0 Conference Paper
%T CaM: Cache Merging for Memory-efficient LLMs Inference
%A Yuxin Zhang
%A Yuxuan Du
%A Gen Luo
%A Yunshan Zhong
%A Zhenyu Zhang
%A Shiwei Liu
%A Rongrong Ji
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-zhang24n
%I PMLR
%P 58840--58850
%U https://proceedings.mlr.press/v235/zhang24n.html
%V 235
%X Despite the exceptional performance of Large Language Models (LLMs), the substantial volume of key-value (KV) pairs cached during inference presents a barrier to their efficient deployment. To ameliorate this, recent works have aimed to selectively eliminate these caches, informed by the attention scores of associated tokens. However, such cache eviction invariably leads to output perturbation, regardless of the token choice. This perturbation escalates with the compression ratio, which can precipitate a marked deterioration in LLM inference performance. This paper introduces Cache Merging (CaM) as a solution to mitigate this challenge. CaM adaptively merges to-be-evicted caches into the remaining ones, employing a novel sampling strategy governed by the prominence of attention scores within discarded locations. In this manner, CaM enables memory-efficient LLMs to preserve critical token information, even obviating the need to maintain their corresponding caches. Extensive experiments utilizing LLaMA, OPT, and GPT-NeoX across various benchmarks corroborate CaM’s proficiency in bolstering the performance of memory-efficient LLMs. Code is released at https://github.com/zyxxmu/cam.

APA


Zhang, Y., Du, Y., Luo, G., Zhong, Y., Zhang, Z., Liu, S. & Ji, R.. (2024). CaM: Cache Merging for Memory-efficient LLMs Inference. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:58840-58850 Available from https://proceedings.mlr.press/v235/zhang24n.html.

CaM: Cache Merging for Memory-efficient LLMs Inference

Abstract

Cite this Paper

Related Material