EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Zhang Qin, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:24391-24402, 2025.

Abstract

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8$\times$ improvements in Time-To-First-Token (TTFT) and 7$\times$ throughput gains over existing systems, with negligible or no accuracy loss.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hu25j, title = {{EPIC}: Efficient Position-Independent Caching for Serving Large Language Models}, author = {Hu, Junhao and Huang, Wenrui and Wang, Weidong and Wang, Haoyi and Hu, Tiancheng and Qin, Zhang and Feng, Hao and Chen, Xusheng and Shan, Yizhou and Xie, Tao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {24391--24402}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hu25j/hu25j.pdf}, url = {https://proceedings.mlr.press/v267/hu25j.html}, abstract = {Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8$\times$ improvements in Time-To-First-Token (TTFT) and 7$\times$ throughput gains over existing systems, with negligible or no accuracy loss.} }
Endnote
%0 Conference Paper %T EPIC: Efficient Position-Independent Caching for Serving Large Language Models %A Junhao Hu %A Wenrui Huang %A Weidong Wang %A Haoyi Wang %A Tiancheng Hu %A Zhang Qin %A Hao Feng %A Xusheng Chen %A Yizhou Shan %A Tao Xie %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hu25j %I PMLR %P 24391--24402 %U https://proceedings.mlr.press/v267/hu25j.html %V 267 %X Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8$\times$ improvements in Time-To-First-Token (TTFT) and 7$\times$ throughput gains over existing systems, with negligible or no accuracy loss.
APA
Hu, J., Huang, W., Wang, W., Wang, H., Hu, T., Qin, Z., Feng, H., Chen, X., Shan, Y. & Xie, T.. (2025). EPIC: Efficient Position-Independent Caching for Serving Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:24391-24402 Available from https://proceedings.mlr.press/v267/hu25j.html.

Related Material