[edit]
MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1516-1561, 2026.
Abstract
Large language models ({LLM}s) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing {LLM} generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of {LLM}-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and completeness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.