LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing

Kuan Li; Liwen Zhang; Yong Jiang; Pengjun Xie; Fei Huang; Shuai Wang; Minhao Cheng

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing

Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:36846-36867, 2025.

Abstract

As Large Language Model (LLM) context windows expand, the necessity of Retrieval-Augmented Generation (RAG) for integrating external knowledge is debated. Existing RAG vs. long-context (LC) LLM comparisons are often inconclusive due to benchmark limitations. We introduce LaRA, a novel benchmark with 2326 test cases across four QA tasks and three long context types, for rigorous evaluation. Our analysis of eleven LLMs reveals the optimal choice between RAG and LC depends on a complex interplay of model capabilities, context length, task type, and retrieval characteristics, offering actionable guidelines for practitioners. Our code and dataset is provided at:https://github.com/Alibaba-NLP/LaRA

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-li25dv,
  title = 	 {{L}a{RA}: Benchmarking Retrieval-Augmented Generation and Long-Context {LLM}s – No Silver Bullet for {LC} or {RAG} Routing},
  author =       {Li, Kuan and Zhang, Liwen and Jiang, Yong and Xie, Pengjun and Huang, Fei and Wang, Shuai and Cheng, Minhao},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {36846--36867},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25dv/li25dv.pdf},
  url = 	 {https://proceedings.mlr.press/v267/li25dv.html},
  abstract = 	 {As Large Language Model (LLM) context windows expand, the necessity of Retrieval-Augmented Generation (RAG) for integrating external knowledge is debated. Existing RAG vs. long-context (LC) LLM comparisons are often inconclusive due to benchmark limitations. We introduce LaRA, a novel benchmark with 2326 test cases across four QA tasks and three long context types, for rigorous evaluation. Our analysis of eleven LLMs reveals the optimal choice between RAG and LC depends on a complex interplay of model capabilities, context length, task type, and retrieval characteristics, offering actionable guidelines for practitioners. Our code and dataset is provided at:https://github.com/Alibaba-NLP/LaRA}
}

Endnote

%0 Conference Paper
%T LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing
%A Kuan Li
%A Liwen Zhang
%A Yong Jiang
%A Pengjun Xie
%A Fei Huang
%A Shuai Wang
%A Minhao Cheng
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-li25dv
%I PMLR
%P 36846--36867
%U https://proceedings.mlr.press/v267/li25dv.html
%V 267
%X As Large Language Model (LLM) context windows expand, the necessity of Retrieval-Augmented Generation (RAG) for integrating external knowledge is debated. Existing RAG vs. long-context (LC) LLM comparisons are often inconclusive due to benchmark limitations. We introduce LaRA, a novel benchmark with 2326 test cases across four QA tasks and three long context types, for rigorous evaluation. Our analysis of eleven LLMs reveals the optimal choice between RAG and LC depends on a complex interplay of model capabilities, context length, task type, and retrieval characteristics, offering actionable guidelines for practitioners. Our code and dataset is provided at:https://github.com/Alibaba-NLP/LaRA

APA

Li, K., Zhang, L., Jiang, Y., Xie, P., Huang, F., Wang, S. & Cheng, M.. (2025). LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:36846-36867 Available from https://proceedings.mlr.press/v267/li25dv.html.

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing

Abstract

Cite this Paper

Related Material