FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Yugang Jia, Jong Ha Lee
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1297-1315, 2026.

Abstract

The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources ({HL7 FHIR}) standard opens a new frontier for clinical {AI}, demanding {LLM} agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent {LLM}s on interoperable clinical data. To bridge this gap, we introduce {FHIR-AgentBench}—a benchmark that grounds 2,931 real-world clinical questions in the {HL7 FHIR} standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct {FHIR} {API} calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate {FHIR} resources and the difficulty of reasoning over them—both of which critically affect question answering performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-lee26a, title = {{FHIR-AgentBench}: Benchmarking {LLM} Agents for Realistic Interoperable {EHR} Question Answering}, author = {Lee, Gyubok and Bach, Elea and Yang, Eric and Pollard, Tom and Johnson, Alistair and Choi, Edward and Jia, Yugang and Lee, Jong Ha}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {1297--1315}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/lee26a/lee26a.pdf}, url = {https://proceedings.mlr.press/v297/lee26a.html}, abstract = {The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources ({HL7 FHIR}) standard opens a new frontier for clinical {AI}, demanding {LLM} agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent {LLM}s on interoperable clinical data. To bridge this gap, we introduce {FHIR-AgentBench}—a benchmark that grounds 2,931 real-world clinical questions in the {HL7 FHIR} standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct {FHIR} {API} calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate {FHIR} resources and the difficulty of reasoning over them—both of which critically affect question answering performance.} }
Endnote
%0 Conference Paper %T FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering %A Gyubok Lee %A Elea Bach %A Eric Yang %A Tom Pollard %A Alistair Johnson %A Edward Choi %A Yugang Jia %A Jong Ha Lee %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-lee26a %I PMLR %P 1297--1315 %U https://proceedings.mlr.press/v297/lee26a.html %V 297 %X The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources ({HL7 FHIR}) standard opens a new frontier for clinical {AI}, demanding {LLM} agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent {LLM}s on interoperable clinical data. To bridge this gap, we introduce {FHIR-AgentBench}—a benchmark that grounds 2,931 real-world clinical questions in the {HL7 FHIR} standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct {FHIR} {API} calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate {FHIR} resources and the difficulty of reasoning over them—both of which critically affect question answering performance.
APA
Lee, G., Bach, E., Yang, E., Pollard, T., Johnson, A., Choi, E., Jia, Y. & Lee, J.H.. (2026). FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1297-1315 Available from https://proceedings.mlr.press/v297/lee26a.html.

Related Material