FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores

Alyssa Huang; Oishi Banerjee; Kay Wu; Eduardo Pontes Reis; Pranav Rajpurkar

FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores

Alyssa Huang, Oishi Banerjee, Kay Wu, Eduardo Pontes Reis, Pranav Rajpurkar

Proceedings of the 9th Machine Learning for Healthcare Conference, PMLR 252, 2024.

Abstract

The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming and costly, especially when evaluating large numbers of reports. In this work, we present FineRadScore, a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports. Given a candidate report and a ground-truth report, FineRadScore gives the minimum number of line-by-line corrections required to go from the candidate to the ground-truth report. Additionally, FineRadScore provides an error severity rating with each correction and generates comments explaining why the correction was needed. We demonstrate that FineRadScore’s corrections and error severity scores align with radiologist opinions. We also show that, when used to judge the quality of the report as a whole, FineRadScore aligns with radiologists as well as current state-of-the-art automated CXR evaluation metrics. Finally, we analyze FineRadScore’s shortcomings to provide suggestions for future improvements.

Cite this Paper

BibTeX

@InProceedings{pmlr-v252-huang24a,
  title = 	 {FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores},
  author =       {Huang, Alyssa and Banerjee, Oishi and Wu, Kay and Reis, Eduardo Pontes and Rajpurkar, Pranav},
  booktitle = 	 {Proceedings of the 9th Machine Learning for Healthcare Conference},
  year = 	 {2024},
  editor = 	 {Deshpande, Kaivalya and Fiterau, Madalina and Joshi, Shalmali and Lipton, Zachary and Ranganath, Rajesh and Urteaga, Iñigo},
  volume = 	 {252},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--17 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v252/main/assets/huang24a/huang24a.pdf},
  url = 	 {https://proceedings.mlr.press/v252/huang24a.html},
  abstract = 	 {The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming and costly, especially when evaluating large numbers of reports. In this work, we present FineRadScore, a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports. Given a candidate report and a ground-truth report, FineRadScore gives the minimum number of line-by-line corrections required to go from the candidate to the ground-truth report. Additionally, FineRadScore provides an error severity rating with each correction and generates comments explaining why the correction was needed. We demonstrate that FineRadScore’s corrections and error severity scores align with radiologist opinions. We also show that, when used to judge the quality of the report as a whole, FineRadScore aligns with radiologists as well as current state-of-the-art automated CXR evaluation metrics. Finally, we analyze FineRadScore’s shortcomings to provide suggestions for future improvements.}
}

Endnote

%0 Conference Paper
%T FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores
%A Alyssa Huang
%A Oishi Banerjee
%A Kay Wu
%A Eduardo Pontes Reis
%A Pranav Rajpurkar
%B Proceedings of the 9th Machine Learning for Healthcare Conference
%C Proceedings of Machine Learning Research
%D 2024
%E Kaivalya Deshpande
%E Madalina Fiterau
%E Shalmali Joshi
%E Zachary Lipton
%E Rajesh Ranganath
%E Iñigo Urteaga	
%F pmlr-v252-huang24a
%I PMLR
%U https://proceedings.mlr.press/v252/huang24a.html
%V 252
%X The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming and costly, especially when evaluating large numbers of reports. In this work, we present FineRadScore, a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports. Given a candidate report and a ground-truth report, FineRadScore gives the minimum number of line-by-line corrections required to go from the candidate to the ground-truth report. Additionally, FineRadScore provides an error severity rating with each correction and generates comments explaining why the correction was needed. We demonstrate that FineRadScore’s corrections and error severity scores align with radiologist opinions. We also show that, when used to judge the quality of the report as a whole, FineRadScore aligns with radiologists as well as current state-of-the-art automated CXR evaluation metrics. Finally, we analyze FineRadScore’s shortcomings to provide suggestions for future improvements.

APA

Huang, A., Banerjee, O., Wu, K., Reis, E.P. & Rajpurkar, P.. (2024). FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores. Proceedings of the 9th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 252 Available from https://proceedings.mlr.press/v252/huang24a.html.

Related Material

Download PDF