[edit]
Evaluating Robustness of LLM-Based Ambient Scribes for SOAP Note Generation
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:448-467, 2026.
Abstract
Clinical documentation is a major driver of clinician workload and burnout, motivating the adoption of ambient AI scribes that transcribe clinician-patient conversations into clinical notes. Safe deployment requires both transcript-grounded fidelity and robustness to upstream Automatic Speech Recognition (ASR) noise-properties not captured by traditional ROUGE-like metrics. We propose a clinically grounded evaluation framework that decomposes notes into atomic, QNOTE-structured facts and applies a two-phase triangulated protocol: (1) align generated facts to clinician-authored gold notes to measure coverage, omission, contradiction, and candidate additions; (2) verify gold-absent generated facts against transcripts to distinguish valid elaborations from unsupported content. Across eight LLM-based note generators, we find that omissions are the primary source of contextual degradation (8.5%–24.0%), while contradictions remain relatively stable (6.2%–7.9%). A large majority of content initially flagged as “added” relative to gold is supported by the transcript (92%), highlighting the importance of transcript verification. Robustness analysis with controlled transcript-level perturbations shows that conversational redundancy often mitigates errors (38.6% recovery), whereas substitution errors (e.g., negation flips, medical homophones) are more likely to propagate when redundancy is absent. These results provide a structured approach for evaluating fidelity and robustness in clinical note generation and suggest practical considerations for safer deployment.