RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Guanzheng Chen; Qilong Feng; Jinjie Ni; Xin Li; Michael Qizhe Shieh

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8093-8107, 2025.

Abstract

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2$\times$ speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-chen25s,
  title = 	 {{RAPID}: Long-Context Inference with Retrieval-Augmented Speculative Decoding},
  author =       {Chen, Guanzheng and Feng, Qilong and Ni, Jinjie and Li, Xin and Shieh, Michael Qizhe},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {8093--8107},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25s/chen25s.pdf},
  url = 	 {https://proceedings.mlr.press/v267/chen25s.html},
  abstract = 	 {The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2$\times$ speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.}
}

Endnote

%0 Conference Paper
%T RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
%A Guanzheng Chen
%A Qilong Feng
%A Jinjie Ni
%A Xin Li
%A Michael Qizhe Shieh
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-chen25s
%I PMLR
%P 8093--8107
%U https://proceedings.mlr.press/v267/chen25s.html
%V 267
%X The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2$\times$ speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.

APA

Chen, G., Feng, Q., Ni, J., Li, X. & Shieh, M.Q.. (2025). RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8093-8107 Available from https://proceedings.mlr.press/v267/chen25s.html.

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Abstract

Cite this Paper

Related Material