Evaluating Robustness of LLM-Based Ambient Scribes for SOAP Note Generation

Ehsan Latif, Aleema Faisal, Shaheer Hammad, Dayyan Ali Akhtar, Agha Ali Raza, Ihsan Ayyub Qazi
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:448-467, 2026.

Abstract

Clinical documentation is a major driver of clinician workload and burnout, motivating the adoption of ambient AI scribes that transcribe clinician-patient conversations into clinical notes. Safe deployment requires both transcript-grounded fidelity and robustness to upstream Automatic Speech Recognition (ASR) noise-properties not captured by traditional ROUGE-like metrics. We propose a clinically grounded evaluation framework that decomposes notes into atomic, QNOTE-structured facts and applies a two-phase triangulated protocol: (1) align generated facts to clinician-authored gold notes to measure coverage, omission, contradiction, and candidate additions; (2) verify gold-absent generated facts against transcripts to distinguish valid elaborations from unsupported content. Across eight LLM-based note generators, we find that omissions are the primary source of contextual degradation (8.5%–24.0%), while contradictions remain relatively stable (6.2%–7.9%). A large majority of content initially flagged as “added” relative to gold is supported by the transcript (92%), highlighting the importance of transcript verification. Robustness analysis with controlled transcript-level perturbations shows that conversational redundancy often mitigates errors (38.6% recovery), whereas substitution errors (e.g., negation flips, medical homophones) are more likely to propagate when redundancy is absent. These results provide a structured approach for evaluating fidelity and robustness in clinical note generation and suggest practical considerations for safer deployment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v333-latif26a, title = {Evaluating Robustness of LLM-Based Ambient Scribes for SOAP Note Generation}, author = {Latif, Ehsan and Faisal, Aleema and Hammad, Shaheer and Akhtar, Dayyan Ali and Raza, Agha Ali and Qazi, Ihsan Ayyub}, booktitle = {Proceedings of the 7th Conference on Health, Inference, and Learning}, pages = {448--467}, year = {2026}, editor = {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily}, volume = {333}, series = {Proceedings of Machine Learning Research}, month = {29--30 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v333/main/assets/latif26a/latif26a.pdf}, url = {https://proceedings.mlr.press/v333/latif26a.html}, abstract = {Clinical documentation is a major driver of clinician workload and burnout, motivating the adoption of ambient AI scribes that transcribe clinician-patient conversations into clinical notes. Safe deployment requires both transcript-grounded fidelity and robustness to upstream Automatic Speech Recognition (ASR) noise-properties not captured by traditional ROUGE-like metrics. We propose a clinically grounded evaluation framework that decomposes notes into atomic, QNOTE-structured facts and applies a two-phase triangulated protocol: (1) align generated facts to clinician-authored gold notes to measure coverage, omission, contradiction, and candidate additions; (2) verify gold-absent generated facts against transcripts to distinguish valid elaborations from unsupported content. Across eight LLM-based note generators, we find that omissions are the primary source of contextual degradation (8.5%–24.0%), while contradictions remain relatively stable (6.2%–7.9%). A large majority of content initially flagged as “added” relative to gold is supported by the transcript (92%), highlighting the importance of transcript verification. Robustness analysis with controlled transcript-level perturbations shows that conversational redundancy often mitigates errors (38.6% recovery), whereas substitution errors (e.g., negation flips, medical homophones) are more likely to propagate when redundancy is absent. These results provide a structured approach for evaluating fidelity and robustness in clinical note generation and suggest practical considerations for safer deployment.} }
Endnote
%0 Conference Paper %T Evaluating Robustness of LLM-Based Ambient Scribes for SOAP Note Generation %A Ehsan Latif %A Aleema Faisal %A Shaheer Hammad %A Dayyan Ali Akhtar %A Agha Ali Raza %A Ihsan Ayyub Qazi %B Proceedings of the 7th Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2026 %E Elizabeth Healey %E Jason Fries %E Tom Pollard %E Shengpu Tang %E Anna Zink %E Tom Hartvigsen %E Monica Agrawal %E Sam Finlayson %E Benjamin Glicksberg %E Brett Beaulieu-Jones %E Kai Wang %E Daseyra Fontalvo %E Tasmie Sarker %E Irene Chen %E Emily Alsentzer %F pmlr-v333-latif26a %I PMLR %P 448--467 %U https://proceedings.mlr.press/v333/latif26a.html %V 333 %X Clinical documentation is a major driver of clinician workload and burnout, motivating the adoption of ambient AI scribes that transcribe clinician-patient conversations into clinical notes. Safe deployment requires both transcript-grounded fidelity and robustness to upstream Automatic Speech Recognition (ASR) noise-properties not captured by traditional ROUGE-like metrics. We propose a clinically grounded evaluation framework that decomposes notes into atomic, QNOTE-structured facts and applies a two-phase triangulated protocol: (1) align generated facts to clinician-authored gold notes to measure coverage, omission, contradiction, and candidate additions; (2) verify gold-absent generated facts against transcripts to distinguish valid elaborations from unsupported content. Across eight LLM-based note generators, we find that omissions are the primary source of contextual degradation (8.5%–24.0%), while contradictions remain relatively stable (6.2%–7.9%). A large majority of content initially flagged as “added” relative to gold is supported by the transcript (92%), highlighting the importance of transcript verification. Robustness analysis with controlled transcript-level perturbations shows that conversational redundancy often mitigates errors (38.6% recovery), whereas substitution errors (e.g., negation flips, medical homophones) are more likely to propagate when redundancy is absent. These results provide a structured approach for evaluating fidelity and robustness in clinical note generation and suggest practical considerations for safer deployment.
APA
Latif, E., Faisal, A., Hammad, S., Akhtar, D.A., Raza, A.A. & Qazi, I.A.. (2026). Evaluating Robustness of LLM-Based Ambient Scribes for SOAP Note Generation. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:448-467 Available from https://proceedings.mlr.press/v333/latif26a.html.

Related Material