MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi; Alexandra DeLucia; Lillian C. Chen; Leslie Miller; Heyuan Huang; Sonal Joshi; Jonathan Lasko; Sarah Collica; Ryan Moore; Haoling Qiu; Peter P. Zandi; Damianos Karakos; Mark Dredze

MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi, Alexandra DeLucia, Lillian C. Chen, Leslie Miller, Heyuan Huang, Sonal Joshi, Jonathan Lasko, Sarah Collica, Ryan Moore, Haoling Qiu, Peter P. Zandi, Damianos Karakos, Mark Dredze

Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1516-1561, 2026.

Abstract

Large language models ({LLM}s) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing {LLM} generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of {LLM}-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and completeness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.

Cite this Paper

BibTeX

@InProceedings{pmlr-v297-yarmohammadi26a,
  title = 	 {{MedExpert}: An Expert-Annotated Dataset for Medical Chatbot Evaluation},
  author =       {Yarmohammadi, Mahsa and DeLucia, Alexandra and Chen, Lillian C. and Miller, Leslie and Huang, Heyuan and Joshi, Sonal and Lasko, Jonathan and Collica, Sarah and Moore, Ryan and Qiu, Haoling and Zandi, Peter P. and Karakos, Damianos and Dredze, Mark},
  booktitle = 	 {Proceedings of the Fifth Machine Learning for Health Symposium},
  pages = 	 {1516--1561},
  year = 	 {2026},
  editor = 	 {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush},
  volume = 	 {297},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v297/main/assets/yarmohammadi26a/yarmohammadi26a.pdf},
  url = 	 {https://proceedings.mlr.press/v297/yarmohammadi26a.html},
  abstract = 	 {Large language models ({LLM}s) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing {LLM} generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of {LLM}-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and completeness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.}
}

Endnote

%0 Conference Paper
%T MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation
%A Mahsa Yarmohammadi
%A Alexandra DeLucia
%A Lillian C. Chen
%A Leslie Miller
%A Heyuan Huang
%A Sonal Joshi
%A Jonathan Lasko
%A Sarah Collica
%A Ryan Moore
%A Haoling Qiu
%A Peter P. Zandi
%A Damianos Karakos
%A Mark Dredze
%B Proceedings of the Fifth Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2026
%E Peniel Argaw
%E Haoran Zhang
%E Sarah Jabbour
%E Payal Chandak
%E Jerry Ji
%E Sumit Mukherjee
%E Olawale Salaudeen
%E Trenton Chang
%E Elizabeth Healey
%E Fabian Gröger
%E Amin Adibi
%E Stefan Hegselmann
%E Benjamin Wild
%E Ayush Noori	
%F pmlr-v297-yarmohammadi26a
%I PMLR
%P 1516--1561
%U https://proceedings.mlr.press/v297/yarmohammadi26a.html
%V 297
%X Large language models ({LLM}s) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing {LLM} generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of {LLM}-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and completeness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.

APA

Yarmohammadi, M., DeLucia, A., Chen, L.C., Miller, L., Huang, H., Joshi, S., Lasko, J., Collica, S., Moore, R., Qiu, H., Zandi, P.P., Karakos, D. & Dredze, M.. (2026). MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1516-1561 Available from https://proceedings.mlr.press/v297/yarmohammadi26a.html.

MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Abstract

Cite this Paper

Related Material