[edit]
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1297-1315, 2026.
Abstract
The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources ({HL7 FHIR}) standard opens a new frontier for clinical {AI}, demanding {LLM} agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent {LLM}s on interoperable clinical data. To bridge this gap, we introduce {FHIR-AgentBench}—a benchmark that grounds 2,931 real-world clinical questions in the {HL7 FHIR} standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct {FHIR} {API} calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate {FHIR} resources and the difficulty of reasoning over them—both of which critically affect question answering performance.