Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Zhihao Zhang; Alan Zhu; Lijie Yang; Yihua Xu; Lanting Li; Phitchaya Mangpo Phothilimthana; Zhihao Jia

Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60626-60643, 2024.

Abstract

This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with speculative retrieval and batched verification. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLMSPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by

$1.75$ -

$2.39\times$ ,

$1.04$ -

$1.39\times$ , and

$1.31$ -

$1.77\times$ when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to

$7.59\times$ and

$2.45\times$ faster than existing methods for exact dense and approximate dense retrievers, respectively.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-zhang24cq,
  title = 	 {Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation},
  author =       {Zhang, Zhihao and Zhu, Alan and Yang, Lijie and Xu, Yihua and Li, Lanting and Phothilimthana, Phitchaya Mangpo and Jia, Zhihao},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {60626--60643},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24cq/zhang24cq.pdf},
  url = 	 {https://proceedings.mlr.press/v235/zhang24cq.html},
  abstract = 	 {This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with speculative retrieval and batched verification. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLMSPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by $1.75$-$2.39\times$, $1.04$-$1.39\times$, and $1.31$-$1.77\times$ when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to $7.59\times$ and $2.45\times$ faster than existing methods for exact dense and approximate dense retrievers, respectively.}
}

Endnote

%0 Conference Paper
%T Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation
%A Zhihao Zhang
%A Alan Zhu
%A Lijie Yang
%A Yihua Xu
%A Lanting Li
%A Phitchaya Mangpo Phothilimthana
%A Zhihao Jia
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-zhang24cq
%I PMLR
%P 60626--60643
%U https://proceedings.mlr.press/v235/zhang24cq.html
%V 235
%X This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with speculative retrieval and batched verification. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLMSPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by $1.75$-$2.39\times$, $1.04$-$1.39\times$, and $1.31$-$1.77\times$ when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to $7.59\times$ and $2.45\times$ faster than existing methods for exact dense and approximate dense retrievers, respectively.

APA


Zhang, Z., Zhu, A., Yang, L., Xu, Y., Li, L., Phothilimthana, P.M. & Jia, Z.. (2024). Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:60626-60643 Available from https://proceedings.mlr.press/v235/zhang24cq.html.

Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Abstract

Cite this Paper

Related Material