Does Domain-Specific Retrieval Augmented Generation Help LLMs Answer Consumer Health Questions?

Chase M Fensore, Rodrigo M Carrillo-Larco, Megha Shah, Joyce C. Ho
Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025.

Abstract

While large language models (LLMs) have shown impressive performance on medical benchmarks, there remains uncertainty about whether retrieval-augmented generation (RAG) meaningfully improves their ability to answer consumer health questions. In this study, we systematically evaluate vanilla LLMs against RAG-enhanced approaches using the NIDDK portion of the MedQuAD dataset. We compare four open-source LLMs in both vanilla and RAG configurations, assessing performance through automated metrics, LLM-based evaluation, and clinical validation. Surprisingly, we find that vanilla LLM approaches consistently outperform RAG variants across both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative assessments. The relatively low retrieval performance (Precision@5 = 0.15) highlights fundamental challenges in implementing effective RAG systems for medical question-answering, even with carefully curated questions. While RAG showed competitive performance in specific areas like scientific consensus and harm reduction, our findings suggest that successful implementation of RAG for consumer health question-answering requires more sophisticated approaches than simple retrieval and prompt engineering. These results contribute to the ongoing discussion about the role of retrieval augmentation in medical AI systems and highlight the need for medical-specific RAG infrastructure to enhance medical question-answering systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v298-fensore25a, title = {Does Domain-Specific Retrieval Augmented Generation Help {LLM}s Answer Consumer Health Questions?}, author = {Fensore, Chase M and Carrillo-Larco, Rodrigo M and Shah, Megha and Ho, Joyce C.}, booktitle = {Proceedings of the 10th Machine Learning for Healthcare Conference}, year = {2025}, editor = {Agrawal, Monica and Deshpande, Kaivalya and Engelhard, Matthew and Joshi, Shalmali and Tang, Shengpu and Urteaga, Iñigo}, volume = {298}, series = {Proceedings of Machine Learning Research}, month = {15--16 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v298/main/assets/fensore25a/fensore25a.pdf}, url = {https://proceedings.mlr.press/v298/fensore25a.html}, abstract = {While large language models (LLMs) have shown impressive performance on medical benchmarks, there remains uncertainty about whether retrieval-augmented generation (RAG) meaningfully improves their ability to answer consumer health questions. In this study, we systematically evaluate vanilla LLMs against RAG-enhanced approaches using the NIDDK portion of the MedQuAD dataset. We compare four open-source LLMs in both vanilla and RAG configurations, assessing performance through automated metrics, LLM-based evaluation, and clinical validation. Surprisingly, we find that vanilla LLM approaches consistently outperform RAG variants across both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative assessments. The relatively low retrieval performance (Precision@5 = 0.15) highlights fundamental challenges in implementing effective RAG systems for medical question-answering, even with carefully curated questions. While RAG showed competitive performance in specific areas like scientific consensus and harm reduction, our findings suggest that successful implementation of RAG for consumer health question-answering requires more sophisticated approaches than simple retrieval and prompt engineering. These results contribute to the ongoing discussion about the role of retrieval augmentation in medical AI systems and highlight the need for medical-specific RAG infrastructure to enhance medical question-answering systems.} }
Endnote
%0 Conference Paper %T Does Domain-Specific Retrieval Augmented Generation Help LLMs Answer Consumer Health Questions? %A Chase M Fensore %A Rodrigo M Carrillo-Larco %A Megha Shah %A Joyce C. Ho %B Proceedings of the 10th Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2025 %E Monica Agrawal %E Kaivalya Deshpande %E Matthew Engelhard %E Shalmali Joshi %E Shengpu Tang %E Iñigo Urteaga %F pmlr-v298-fensore25a %I PMLR %U https://proceedings.mlr.press/v298/fensore25a.html %V 298 %X While large language models (LLMs) have shown impressive performance on medical benchmarks, there remains uncertainty about whether retrieval-augmented generation (RAG) meaningfully improves their ability to answer consumer health questions. In this study, we systematically evaluate vanilla LLMs against RAG-enhanced approaches using the NIDDK portion of the MedQuAD dataset. We compare four open-source LLMs in both vanilla and RAG configurations, assessing performance through automated metrics, LLM-based evaluation, and clinical validation. Surprisingly, we find that vanilla LLM approaches consistently outperform RAG variants across both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative assessments. The relatively low retrieval performance (Precision@5 = 0.15) highlights fundamental challenges in implementing effective RAG systems for medical question-answering, even with carefully curated questions. While RAG showed competitive performance in specific areas like scientific consensus and harm reduction, our findings suggest that successful implementation of RAG for consumer health question-answering requires more sophisticated approaches than simple retrieval and prompt engineering. These results contribute to the ongoing discussion about the role of retrieval augmentation in medical AI systems and highlight the need for medical-specific RAG infrastructure to enhance medical question-answering systems.
APA
Fensore, C.M., Carrillo-Larco, R.M., Shah, M. & Ho, J.C.. (2025). Does Domain-Specific Retrieval Augmented Generation Help LLMs Answer Consumer Health Questions?. Proceedings of the 10th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 298 Available from https://proceedings.mlr.press/v298/fensore25a.html.

Related Material