RAGGED: Towards Informed Design of Scalable and Stable RAG Systems

Jennifer Hsia, Afreen Shaikh, Zora Zhiruo Wang, Graham Neubig
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:24139-24155, 2025.

Abstract

Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hsia25a, title = {{RAGGED}: Towards Informed Design of Scalable and Stable {RAG} Systems}, author = {Hsia, Jennifer and Shaikh, Afreen and Wang, Zora Zhiruo and Neubig, Graham}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {24139--24155}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hsia25a/hsia25a.pdf}, url = {https://proceedings.mlr.press/v267/hsia25a.html}, abstract = {Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.} }
Endnote
%0 Conference Paper %T RAGGED: Towards Informed Design of Scalable and Stable RAG Systems %A Jennifer Hsia %A Afreen Shaikh %A Zora Zhiruo Wang %A Graham Neubig %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hsia25a %I PMLR %P 24139--24155 %U https://proceedings.mlr.press/v267/hsia25a.html %V 267 %X Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.
APA
Hsia, J., Shaikh, A., Wang, Z.Z. & Neubig, G.. (2025). RAGGED: Towards Informed Design of Scalable and Stable RAG Systems. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:24139-24155 Available from https://proceedings.mlr.press/v267/hsia25a.html.

Related Material