[edit]
RadRevise: A Benchmark Dataset for Instruction-Based Radiology Report Editing
Proceedings of The First AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 281:183-194, 2025.
Abstract
Large Language Models (LLMs) can assist radiologists by making precise edits to radiology reports based on human instructions. However, evaluating the quality of such modifications has been challenging due to the lack of publicly available datasets. To address this gap, we present RadRevise, a novel dataset for assessing models’ ability to modify radiology reports according to specific instructions. RadRevise is derived from the radiology reports in the MIMIC-CXR dataset and includes 6,402 instructions and 2,922 modified reports. For each report, the dataset includes a set of one to five modification instructions, along with the corresponding modified output, covering various clinical topics and instruction types. Our benchmarking of current open-source models reveals performance gaps in accurately executing these instructions, highlighting areas for improvement in AI-assisted report modification.