Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering

Yella Diekmann, Chase M Fensore, Rodrigo M Carrillo-Larco, Nishant Pradhan, Bhavya Appana, Joyce C Ho
Proceedings of the 4th Machine Learning for Health Symposium, PMLR 259:267-290, 2025.

Abstract

Large language models (LLMs) have revolutionized the question answering (QA) domain by achieving near-human performance across a broad range of tasks. Recent studies have suggested LLMs are capable of answering clinical questions and providing medical advice. Although LLMs’ answers must be reliable and safe, existing evaluations of medical QA systems often only focus on the accuracy of the content. However, a critical, underexplored aspect is whether variations in patient inquiries - rephrasing the same question - lead to inconsistent or unsafe LLM responses. We propose a new evaluation methodology leveraging synthetic question generation to rigorously assess the safety of LLMs in patient-facing medical QA. In benchmarking 8 LLMs, we observe a weak correlation between standard automated quality metrics and human evaluations, underscoring the need for enhanced sensitivity analysis in evaluating patient medical QA safety.

Cite this Paper


BibTeX
@InProceedings{pmlr-v259-diekmann25a, title = {Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering}, author = {Diekmann, Yella and Fensore, Chase M and Carrillo-Larco, Rodrigo M and Pradhan, Nishant and Appana, Bhavya and Ho, Joyce C}, booktitle = {Proceedings of the 4th Machine Learning for Health Symposium}, pages = {267--290}, year = {2025}, editor = {Hegselmann, Stefan and Zhou, Helen and Healey, Elizabeth and Chang, Trenton and Ellington, Caleb and Mhasawade, Vishwali and Tonekaboni, Sana and Argaw, Peniel and Zhang, Haoran}, volume = {259}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v259/main/assets/diekmann25a/diekmann25a.pdf}, url = {https://proceedings.mlr.press/v259/diekmann25a.html}, abstract = {Large language models (LLMs) have revolutionized the question answering (QA) domain by achieving near-human performance across a broad range of tasks. Recent studies have suggested LLMs are capable of answering clinical questions and providing medical advice. Although LLMs’ answers must be reliable and safe, existing evaluations of medical QA systems often only focus on the accuracy of the content. However, a critical, underexplored aspect is whether variations in patient inquiries - rephrasing the same question - lead to inconsistent or unsafe LLM responses. We propose a new evaluation methodology leveraging synthetic question generation to rigorously assess the safety of LLMs in patient-facing medical QA. In benchmarking 8 LLMs, we observe a weak correlation between standard automated quality metrics and human evaluations, underscoring the need for enhanced sensitivity analysis in evaluating patient medical QA safety.} }
Endnote
%0 Conference Paper %T Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering %A Yella Diekmann %A Chase M Fensore %A Rodrigo M Carrillo-Larco %A Nishant Pradhan %A Bhavya Appana %A Joyce C Ho %B Proceedings of the 4th Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2025 %E Stefan Hegselmann %E Helen Zhou %E Elizabeth Healey %E Trenton Chang %E Caleb Ellington %E Vishwali Mhasawade %E Sana Tonekaboni %E Peniel Argaw %E Haoran Zhang %F pmlr-v259-diekmann25a %I PMLR %P 267--290 %U https://proceedings.mlr.press/v259/diekmann25a.html %V 259 %X Large language models (LLMs) have revolutionized the question answering (QA) domain by achieving near-human performance across a broad range of tasks. Recent studies have suggested LLMs are capable of answering clinical questions and providing medical advice. Although LLMs’ answers must be reliable and safe, existing evaluations of medical QA systems often only focus on the accuracy of the content. However, a critical, underexplored aspect is whether variations in patient inquiries - rephrasing the same question - lead to inconsistent or unsafe LLM responses. We propose a new evaluation methodology leveraging synthetic question generation to rigorously assess the safety of LLMs in patient-facing medical QA. In benchmarking 8 LLMs, we observe a weak correlation between standard automated quality metrics and human evaluations, underscoring the need for enhanced sensitivity analysis in evaluating patient medical QA safety.
APA
Diekmann, Y., Fensore, C.M., Carrillo-Larco, R.M., Pradhan, N., Appana, B. & Ho, J.C.. (2025). Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering. Proceedings of the 4th Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 259:267-290 Available from https://proceedings.mlr.press/v259/diekmann25a.html.

Related Material