RadRevise: A Benchmark Dataset for Instruction-Based Radiology Report Editing

Yixuan Huang; Julián Nicolás Acosta; Pranav Rajpurkar

RadRevise: A Benchmark Dataset for Instruction-Based Radiology Report Editing

Yixuan Huang, Julián Nicolás Acosta, Pranav Rajpurkar

Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 281:183-194, 2025.

Abstract

Large Language Models (LLMs) can assist radiologists by making precise edits to radiology reports based on human instructions. However, evaluating the quality of such modifications has been challenging due to the lack of publicly available datasets. To address this gap, we present RadRevise, a novel dataset for assessing models’ ability to modify radiology reports according to specific instructions. RadRevise is derived from the radiology reports in the MIMIC-CXR dataset and includes 6,402 instructions and 2,922 modified reports. For each report, the dataset includes a set of one to five modification instructions, along with the corresponding modified output, covering various clinical topics and instruction types. Our benchmarking of current open-source models reveals performance gaps in accurately executing these instructions, highlighting areas for improvement in AI-assisted report modification.

Cite this Paper

BibTeX

@InProceedings{pmlr-v281-huang25a,
  title = 	 {RadRevise: A Benchmark Dataset for Instruction-Based Radiology Report Editing},
  author =       {Huang, Yixuan and Acosta, Juli\'an Nicol\'as and Rajpurkar, Pranav},
  booktitle = 	 {Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare},
  pages = 	 {183--194},
  year = 	 {2025},
  editor = 	 {Wu, Junde and Zhu, Jiayuan and Xu, Min and Jin, Yueming},
  volume = 	 {281},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v281/main/assets/huang25a/huang25a.pdf},
  url = 	 {https://proceedings.mlr.press/v281/huang25a.html},
  abstract = 	 {Large Language Models (LLMs) can assist radiologists by making precise edits to radiology reports based on human instructions. However, evaluating the quality of such modifications has been challenging due to the lack of publicly available datasets. To address this gap, we present RadRevise, a novel dataset for assessing models’ ability to modify radiology reports according to specific instructions. RadRevise is derived from the radiology reports in the MIMIC-CXR dataset and includes 6,402 instructions and 2,922 modified reports. For each report, the dataset includes a set of one to five modification instructions, along with the corresponding modified output, covering various clinical topics and instruction types. Our benchmarking of current open-source models reveals performance gaps in accurately executing these instructions, highlighting areas for improvement in AI-assisted report modification.}
}

Endnote

%0 Conference Paper
%T RadRevise: A Benchmark Dataset for Instruction-Based Radiology Report Editing
%A Yixuan Huang
%A Julián Nicolás Acosta
%A Pranav Rajpurkar
%B Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare
%C Proceedings of Machine Learning Research
%D 2025
%E Junde Wu
%E Jiayuan Zhu
%E Min Xu
%E Yueming Jin	
%F pmlr-v281-huang25a
%I PMLR
%P 183--194
%U https://proceedings.mlr.press/v281/huang25a.html
%V 281
%X Large Language Models (LLMs) can assist radiologists by making precise edits to radiology reports based on human instructions. However, evaluating the quality of such modifications has been challenging due to the lack of publicly available datasets. To address this gap, we present RadRevise, a novel dataset for assessing models’ ability to modify radiology reports according to specific instructions. RadRevise is derived from the radiology reports in the MIMIC-CXR dataset and includes 6,402 instructions and 2,922 modified reports. For each report, the dataset includes a set of one to five modification instructions, along with the corresponding modified output, covering various clinical topics and instruction types. Our benchmarking of current open-source models reveals performance gaps in accurately executing these instructions, highlighting areas for improvement in AI-assisted report modification.

APA

Huang, Y., Acosta, J.N. & Rajpurkar, P.. (2025). RadRevise: A Benchmark Dataset for Instruction-Based Radiology Report Editing. Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare, in Proceedings of Machine Learning Research 281:183-194 Available from https://proceedings.mlr.press/v281/huang25a.html.

Related Material

Download PDF