MED-OMIT: Extrinsically-Focused Evaluation Metric for Omissions in Medical Summarization

Elliot Schumacher, Daniel Rosenthal, Dhruv Naik, Varun Nair, Luladay Price, Geoffrey Tso, Anitha Kannan
Proceedings of the 4th Machine Learning for Health Symposium, PMLR 259:897-922, 2025.

Abstract

Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged. An example of this challenge is in evaluating a summary of the patient’s medical record. A resulting summary can enable the provider to get a high-level overview of the patient’s health status quickly. Yet, a summary that omits important facts about the patient’s record can produce a misleading picture. This can lead to negative consequences on medical decision-making. We propose MED-OMIT as a metric to explore this challenge. We focus on using provider-patient history conversations to generate a subjective (a summary of the patient’s history) as a case study. We begin by discretizing facts from the dialogue and identifying which are omitted from the subjective. To determine which facts are clinically relevant, we measure the importance of each fact to a simulated differential diagnosis. We compare MED-OMIT’s performance to that of clinical experts and find broad agreement We use MED-OMIT to evaluate LLM performance on subjective generation and find some LLMs (gpt-4 and llama-3.1-405b) work well with little effort, while others (e.g. Llama 2) perform worse.

Cite this Paper


BibTeX
@InProceedings{pmlr-v259-schumacher25a, title = {MED-OMIT: Extrinsically-Focused Evaluation Metric for Omissions in Medical Summarization}, author = {Schumacher, Elliot and Rosenthal, Daniel and Naik, Dhruv and Nair, Varun and Price, Luladay and Tso, Geoffrey and Kannan, Anitha}, booktitle = {Proceedings of the 4th Machine Learning for Health Symposium}, pages = {897--922}, year = {2025}, editor = {Hegselmann, Stefan and Zhou, Helen and Healey, Elizabeth and Chang, Trenton and Ellington, Caleb and Mhasawade, Vishwali and Tonekaboni, Sana and Argaw, Peniel and Zhang, Haoran}, volume = {259}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v259/main/assets/schumacher25a/schumacher25a.pdf}, url = {https://proceedings.mlr.press/v259/schumacher25a.html}, abstract = {Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged. An example of this challenge is in evaluating a summary of the patient’s medical record. A resulting summary can enable the provider to get a high-level overview of the patient’s health status quickly. Yet, a summary that omits important facts about the patient’s record can produce a misleading picture. This can lead to negative consequences on medical decision-making. We propose MED-OMIT as a metric to explore this challenge. We focus on using provider-patient history conversations to generate a subjective (a summary of the patient’s history) as a case study. We begin by discretizing facts from the dialogue and identifying which are omitted from the subjective. To determine which facts are clinically relevant, we measure the importance of each fact to a simulated differential diagnosis. We compare MED-OMIT’s performance to that of clinical experts and find broad agreement We use MED-OMIT to evaluate LLM performance on subjective generation and find some LLMs (gpt-4 and llama-3.1-405b) work well with little effort, while others (e.g. Llama 2) perform worse.} }
Endnote
%0 Conference Paper %T MED-OMIT: Extrinsically-Focused Evaluation Metric for Omissions in Medical Summarization %A Elliot Schumacher %A Daniel Rosenthal %A Dhruv Naik %A Varun Nair %A Luladay Price %A Geoffrey Tso %A Anitha Kannan %B Proceedings of the 4th Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2025 %E Stefan Hegselmann %E Helen Zhou %E Elizabeth Healey %E Trenton Chang %E Caleb Ellington %E Vishwali Mhasawade %E Sana Tonekaboni %E Peniel Argaw %E Haoran Zhang %F pmlr-v259-schumacher25a %I PMLR %P 897--922 %U https://proceedings.mlr.press/v259/schumacher25a.html %V 259 %X Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged. An example of this challenge is in evaluating a summary of the patient’s medical record. A resulting summary can enable the provider to get a high-level overview of the patient’s health status quickly. Yet, a summary that omits important facts about the patient’s record can produce a misleading picture. This can lead to negative consequences on medical decision-making. We propose MED-OMIT as a metric to explore this challenge. We focus on using provider-patient history conversations to generate a subjective (a summary of the patient’s history) as a case study. We begin by discretizing facts from the dialogue and identifying which are omitted from the subjective. To determine which facts are clinically relevant, we measure the importance of each fact to a simulated differential diagnosis. We compare MED-OMIT’s performance to that of clinical experts and find broad agreement We use MED-OMIT to evaluate LLM performance on subjective generation and find some LLMs (gpt-4 and llama-3.1-405b) work well with little effort, while others (e.g. Llama 2) perform worse.
APA
Schumacher, E., Rosenthal, D., Naik, D., Nair, V., Price, L., Tso, G. & Kannan, A.. (2025). MED-OMIT: Extrinsically-Focused Evaluation Metric for Omissions in Medical Summarization. Proceedings of the 4th Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 259:897-922 Available from https://proceedings.mlr.press/v259/schumacher25a.html.

Related Material