LUNGUAGE: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon; Geon Choi; Paloma Rabaey; Min Gwan Kim; Jung-Oh Lee; Hyuk Gi Hong; Eun Woo Doe; Hangyul Yoon; Jiyoun Kim; Harshita Sharma; Daniel C Castro; Javier Alvarez-Valle; Edward Choi

LUNGUAGE: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Jung-Oh Lee, Hyuk Gi Hong, Eun Woo Doe, Hangyul Yoon, Jiyoun Kim, Harshita Sharma, Daniel C Castro, Javier Alvarez-Valle, Edward Choi

Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:148-197, 2026.

Abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v333-moon26a,
  title = 	 {LUNGUAGE: A Benchmark for Structured and Sequential Chest X-ray Interpretation},
  author =       {Moon, Jong Hak and Choi, Geon and Rabaey, Paloma and Gwan Kim, Min and Lee, Jung-Oh and Hong, Hyuk Gi and Doe, Eun Woo and Yoon, Hangyul and Kim, Jiyoun and Sharma, Harshita and Castro, Daniel C and Alvarez-Valle, Javier and Choi, Edward},
  booktitle = 	 {Proceedings of the 7th Conference on Health, Inference, and Learning},
  pages = 	 {148--197},
  year = 	 {2026},
  editor = 	 {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily},
  volume = 	 {333},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29--30 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v333/main/assets/moon26a/moon26a.pdf},
  url = 	 {https://proceedings.mlr.press/v333/moon26a.html},
  abstract = 	 {Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation.}
}

Endnote

%0 Conference Paper
%T LUNGUAGE: A Benchmark for Structured and Sequential Chest X-ray Interpretation
%A Jong Hak Moon
%A Geon Choi
%A Paloma Rabaey
%A Min Gwan Kim
%A Jung-Oh Lee
%A Hyuk Gi Hong
%A Eun Woo Doe
%A Hangyul Yoon
%A Jiyoun Kim
%A Harshita Sharma
%A Daniel C Castro
%A Javier Alvarez-Valle
%A Edward Choi
%B Proceedings of the 7th Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Elizabeth Healey
%E Jason Fries
%E Tom Pollard
%E Shengpu Tang
%E Anna Zink
%E Tom Hartvigsen
%E Monica Agrawal
%E Sam Finlayson
%E Benjamin Glicksberg
%E Brett Beaulieu-Jones
%E Kai Wang
%E Daseyra Fontalvo
%E Tasmie Sarker
%E Irene Chen
%E Emily Alsentzer	
%F pmlr-v333-moon26a
%I PMLR
%P 148--197
%U https://proceedings.mlr.press/v333/moon26a.html
%V 333
%X Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation.

APA

Moon, J.H., Choi, G., Rabaey, P., Gwan Kim, M., Lee, J., Hong, H.G., Doe, E.W., Yoon, H., Kim, J., Sharma, H., Castro, D.C., Alvarez-Valle, J. & Choi, E.. (2026). LUNGUAGE: A Benchmark for Structured and Sequential Chest X-ray Interpretation. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:148-197 Available from https://proceedings.mlr.press/v333/moon26a.html.

Related Material

Download PDF