Evaluating ChatGPT’s Performance in Generating and Assessing Dutch Radiology Report Impressions

Luc Builtjes; Monique Brink; Souraya Belkhir; Bram van Ginneken; Alessa Hering

Evaluating ChatGPT’s Performance in Generating and Assessing Dutch Radiology Report Impressions

Luc Builtjes, Monique Brink, Souraya Belkhir, Bram van Ginneken, Alessa Hering

Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, PMLR 250:168-183, 2024.

Abstract

The integration of Large Language Models (LLMs), such as ChatGPT, in radiology couldoffer insight and interpretation to the increasing number of radiological findings generatedby Artificial Intelligence (AI). However, the complexity of medical text presents many chal-lenges for LLMs, particularly in uncommon languages such as Dutch. This study thereforeaims to evaluate ChatGPT’s ability to generate accurate ‘Impression’ sections of radiol-ogy reports, and its effectiveness in evaluating these sections compared against humanradiologist judgments. We utilized a dataset of CT-thorax radiology reports to fine-tuneChatGPT and then conducted a reader study with two radiologists and GPT-4 out-of-the-box to evaluate the AI-generated ‘Impression’ sections in comparison to the originals. Theresults revealed that human experts rated original impressions higher than AI-generatedones across correctness, completeness, and conciseness, highlighting a gap in the AI’s abil-ity to generate clinically reliable medical text. Additionally, GPT-4’s evaluations weremore favorable towards AI-generated content, indicating limitations in its out-of-the-boxuse as an evaluator in specialized domains. The study emphasizes the need for cautiousintegration of LLMs into medical domains and the importance of expert validation, yetalso acknowledges the inherent subjectivity in interpreting and evaluating medical reports.

Cite this Paper

BibTeX

@InProceedings{pmlr-v250-builtjes24a,
  title = 	 {Evaluating ChatGPT’s Performance in Generating and Assessing Dutch Radiology Report Impressions},
  author =       {Builtjes, Luc and Brink, Monique and Belkhir, Souraya and van Ginneken, Bram and Hering, Alessa},
  booktitle = 	 {Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning},
  pages = 	 {168--183},
  year = 	 {2024},
  editor = 	 {Burgos, Ninon and Petitjean, Caroline and Vakalopoulou, Maria and Christodoulidis, Stergios and Coupe, Pierrick and Delingette, Hervé and Lartizien, Carole and Mateus, Diana},
  volume = 	 {250},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v250/main/assets/builtjes24a/builtjes24a.pdf},
  url = 	 {https://proceedings.mlr.press/v250/builtjes24a.html},
  abstract = 	 {The integration of Large Language Models (LLMs), such as ChatGPT, in radiology couldoffer insight and interpretation to the increasing number of radiological findings generatedby Artificial Intelligence (AI). However, the complexity of medical text presents many chal-lenges for LLMs, particularly in uncommon languages such as Dutch. This study thereforeaims to evaluate ChatGPT’s ability to generate accurate ‘Impression’ sections of radiol-ogy reports, and its effectiveness in evaluating these sections compared against humanradiologist judgments. We utilized a dataset of CT-thorax radiology reports to fine-tuneChatGPT and then conducted a reader study with two radiologists and GPT-4 out-of-the-box to evaluate the AI-generated ‘Impression’ sections in comparison to the originals. Theresults revealed that human experts rated original impressions higher than AI-generatedones across correctness, completeness, and conciseness, highlighting a gap in the AI’s abil-ity to generate clinically reliable medical text. Additionally, GPT-4’s evaluations weremore favorable towards AI-generated content, indicating limitations in its out-of-the-boxuse as an evaluator in specialized domains. The study emphasizes the need for cautiousintegration of LLMs into medical domains and the importance of expert validation, yetalso acknowledges the inherent subjectivity in interpreting and evaluating medical reports.}
}

Endnote

%0 Conference Paper
%T Evaluating ChatGPT’s Performance in Generating and Assessing Dutch Radiology Report Impressions
%A Luc Builtjes
%A Monique Brink
%A Souraya Belkhir
%A Bram van Ginneken
%A Alessa Hering
%B Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ninon Burgos
%E Caroline Petitjean
%E Maria Vakalopoulou
%E Stergios Christodoulidis
%E Pierrick Coupe
%E Hervé Delingette
%E Carole Lartizien
%E Diana Mateus	
%F pmlr-v250-builtjes24a
%I PMLR
%P 168--183
%U https://proceedings.mlr.press/v250/builtjes24a.html
%V 250
%X The integration of Large Language Models (LLMs), such as ChatGPT, in radiology couldoffer insight and interpretation to the increasing number of radiological findings generatedby Artificial Intelligence (AI). However, the complexity of medical text presents many chal-lenges for LLMs, particularly in uncommon languages such as Dutch. This study thereforeaims to evaluate ChatGPT’s ability to generate accurate ‘Impression’ sections of radiol-ogy reports, and its effectiveness in evaluating these sections compared against humanradiologist judgments. We utilized a dataset of CT-thorax radiology reports to fine-tuneChatGPT and then conducted a reader study with two radiologists and GPT-4 out-of-the-box to evaluate the AI-generated ‘Impression’ sections in comparison to the originals. Theresults revealed that human experts rated original impressions higher than AI-generatedones across correctness, completeness, and conciseness, highlighting a gap in the AI’s abil-ity to generate clinically reliable medical text. Additionally, GPT-4’s evaluations weremore favorable towards AI-generated content, indicating limitations in its out-of-the-boxuse as an evaluator in specialized domains. The study emphasizes the need for cautiousintegration of LLMs into medical domains and the importance of expert validation, yetalso acknowledges the inherent subjectivity in interpreting and evaluating medical reports.

APA

Builtjes, L., Brink, M., Belkhir, S., van Ginneken, B. & Hering, A.. (2024). Evaluating ChatGPT’s Performance in Generating and Assessing Dutch Radiology Report Impressions. Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 250:168-183 Available from https://proceedings.mlr.press/v250/builtjes24a.html.

Related Material

Download PDF