Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning

Halimat Afolabi, Zainab Afolabi, Elizabeth Friel, Jude Roberts, Antonio Ji-Xu, Lloyd Chen, Egheosa Ogbomo, Emiliomo Imevbore, Phil Eneje, Wissal El Ouahidi, Aaron Sohal, Alisa Kennan, Shreya Srivastava, Anirudh Vairavan, Laura Napitu, Katie McClure
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1562-1591, 2026.

Abstract

Closed-source large language models ({LLM}s), such as ChatGPT and Gemini, are increasingly consulted for medical advice, yet their explanations may appear plausible while failing to reflect the model’s underlying reasoning process. This gap poses serious risks as patients and clinicians may trust coherent but misleading explanations. We conduct a systematic black-box evaluation of faithfulness in medical reasoning among three widely used closed-source {LLM}s. Our study consists of three perturbation-based probes: (1) causal ablation, testing whether stated chain-of-thought ({CoT}) reasoning causally influences predictions; (2) positional bias, examining whether models create post-hoc justifications for answers driven by input positioning; and (3) hint injection, testing susceptibility to external suggestions. We complement these quantitative probes with a small-scale human evaluation of model responses to patient-style medical queries to examine concordance between physician assessments of explanation faithfulness and layperson perceptions of trustworthiness. We find that {CoT} reasoning steps often do not causally drive predictions, and models readily incorporate external hints without acknowledgment. In contrast, positional biases showed minimal impact in this setting. These results underscore that faithfulness, not just accuracy, must be central in evaluating {LLM}s for medicine, to ensure both public protection and safe clinical deployment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-afolabi26a, title = {Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source {LLM}s in Medical Reasoning}, author = {Afolabi, Halimat and Afolabi, Zainab and Friel, Elizabeth and Roberts, Jude and Ji-Xu, Antonio and Chen, Lloyd and Ogbomo, Egheosa and Imevbore, Emiliomo and Eneje, Phil and El Ouahidi, Wissal and Sohal, Aaron and Kennan, Alisa and Srivastava, Shreya and Vairavan, Anirudh and Napitu, Laura and McClure, Katie}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {1562--1591}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/afolabi26a/afolabi26a.pdf}, url = {https://proceedings.mlr.press/v297/afolabi26a.html}, abstract = {Closed-source large language models ({LLM}s), such as ChatGPT and Gemini, are increasingly consulted for medical advice, yet their explanations may appear plausible while failing to reflect the model’s underlying reasoning process. This gap poses serious risks as patients and clinicians may trust coherent but misleading explanations. We conduct a systematic black-box evaluation of faithfulness in medical reasoning among three widely used closed-source {LLM}s. Our study consists of three perturbation-based probes: (1) causal ablation, testing whether stated chain-of-thought ({CoT}) reasoning causally influences predictions; (2) positional bias, examining whether models create post-hoc justifications for answers driven by input positioning; and (3) hint injection, testing susceptibility to external suggestions. We complement these quantitative probes with a small-scale human evaluation of model responses to patient-style medical queries to examine concordance between physician assessments of explanation faithfulness and layperson perceptions of trustworthiness. We find that {CoT} reasoning steps often do not causally drive predictions, and models readily incorporate external hints without acknowledgment. In contrast, positional biases showed minimal impact in this setting. These results underscore that faithfulness, not just accuracy, must be central in evaluating {LLM}s for medicine, to ensure both public protection and safe clinical deployment.} }
Endnote
%0 Conference Paper %T Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning %A Halimat Afolabi %A Zainab Afolabi %A Elizabeth Friel %A Jude Roberts %A Antonio Ji-Xu %A Lloyd Chen %A Egheosa Ogbomo %A Emiliomo Imevbore %A Phil Eneje %A Wissal El Ouahidi %A Aaron Sohal %A Alisa Kennan %A Shreya Srivastava %A Anirudh Vairavan %A Laura Napitu %A Katie McClure %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-afolabi26a %I PMLR %P 1562--1591 %U https://proceedings.mlr.press/v297/afolabi26a.html %V 297 %X Closed-source large language models ({LLM}s), such as ChatGPT and Gemini, are increasingly consulted for medical advice, yet their explanations may appear plausible while failing to reflect the model’s underlying reasoning process. This gap poses serious risks as patients and clinicians may trust coherent but misleading explanations. We conduct a systematic black-box evaluation of faithfulness in medical reasoning among three widely used closed-source {LLM}s. Our study consists of three perturbation-based probes: (1) causal ablation, testing whether stated chain-of-thought ({CoT}) reasoning causally influences predictions; (2) positional bias, examining whether models create post-hoc justifications for answers driven by input positioning; and (3) hint injection, testing susceptibility to external suggestions. We complement these quantitative probes with a small-scale human evaluation of model responses to patient-style medical queries to examine concordance between physician assessments of explanation faithfulness and layperson perceptions of trustworthiness. We find that {CoT} reasoning steps often do not causally drive predictions, and models readily incorporate external hints without acknowledgment. In contrast, positional biases showed minimal impact in this setting. These results underscore that faithfulness, not just accuracy, must be central in evaluating {LLM}s for medicine, to ensure both public protection and safe clinical deployment.
APA
Afolabi, H., Afolabi, Z., Friel, E., Roberts, J., Ji-Xu, A., Chen, L., Ogbomo, E., Imevbore, E., Eneje, P., El Ouahidi, W., Sohal, A., Kennan, A., Srivastava, S., Vairavan, A., Napitu, L. & McClure, K.. (2026). Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1562-1591 Available from https://proceedings.mlr.press/v297/afolabi26a.html.

Related Material