CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

Xiao Yu Cindy Zhang, Carlos R. Ferreira, Francis Rossignol, Raymond T. Ng, Wyeth Wasserman, Jian Zhu
Proceedings of the sixth Conference on Health, Inference, and Learning, PMLR 287:527-542, 2025.

Abstract

Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable dense information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-crafted dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of category-specific prompting and \textbf{subheading-filtered data integration}. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. Category-specific prompting improves alignment to benchmark. Open-source Qwen2.5:7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management, while highlighting areas for improvement, such as LLM’s limitation in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable, privacy-conscious medical AI applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v287-zhang25b, title = {CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports}, author = {Zhang, Xiao Yu Cindy and Ferreira, Carlos R. and Rossignol, Francis and Ng, Raymond T. and Wasserman, Wyeth and Zhu, Jian}, booktitle = {Proceedings of the sixth Conference on Health, Inference, and Learning}, pages = {527--542}, year = {2025}, editor = {Xu, Xuhai Orson and Choi, Edward and Singhal, Pankhuri and Gerych, Walter and Tang, Shengpu and Agrawal, Monica and Subbaswamy, Adarsh and Sizikova, Elena and Dunn, Jessilyn and Daneshjou, Roxana and Sarker, Tasmie and McDermott, Matthew and Chen, Irene}, volume = {287}, series = {Proceedings of Machine Learning Research}, month = {25--27 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v287/main/assets/zhang25b/zhang25b.pdf}, url = {https://proceedings.mlr.press/v287/zhang25b.html}, abstract = {Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable dense information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-crafted dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of category-specific prompting and \textbf{subheading-filtered data integration}. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. Category-specific prompting improves alignment to benchmark. Open-source Qwen2.5:7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management, while highlighting areas for improvement, such as LLM’s limitation in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable, privacy-conscious medical AI applications.} }
Endnote
%0 Conference Paper %T CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports %A Xiao Yu Cindy Zhang %A Carlos R. Ferreira %A Francis Rossignol %A Raymond T. Ng %A Wyeth Wasserman %A Jian Zhu %B Proceedings of the sixth Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2025 %E Xuhai Orson Xu %E Edward Choi %E Pankhuri Singhal %E Walter Gerych %E Shengpu Tang %E Monica Agrawal %E Adarsh Subbaswamy %E Elena Sizikova %E Jessilyn Dunn %E Roxana Daneshjou %E Tasmie Sarker %E Matthew McDermott %E Irene Chen %F pmlr-v287-zhang25b %I PMLR %P 527--542 %U https://proceedings.mlr.press/v287/zhang25b.html %V 287 %X Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable dense information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-crafted dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of category-specific prompting and \textbf{subheading-filtered data integration}. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. Category-specific prompting improves alignment to benchmark. Open-source Qwen2.5:7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management, while highlighting areas for improvement, such as LLM’s limitation in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable, privacy-conscious medical AI applications.
APA
Zhang, X.Y.C., Ferreira, C.R., Rossignol, F., Ng, R.T., Wasserman, W. & Zhu, J.. (2025). CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports. Proceedings of the sixth Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 287:527-542 Available from https://proceedings.mlr.press/v287/zhang25b.html.

Related Material