[edit]
Evaluating ChatGPT’s Performance in Generating and Assessing Dutch Radiology Report Impressions
Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, PMLR 250:168-183, 2024.
Abstract
The integration of Large Language Models (LLMs), such as ChatGPT, in radiology couldoffer insight and interpretation to the increasing number of radiological findings generatedby Artificial Intelligence (AI). However, the complexity of medical text presents many chal-lenges for LLMs, particularly in uncommon languages such as Dutch. This study thereforeaims to evaluate ChatGPT’s ability to generate accurate ‘Impression’ sections of radiol-ogy reports, and its effectiveness in evaluating these sections compared against humanradiologist judgments. We utilized a dataset of CT-thorax radiology reports to fine-tuneChatGPT and then conducted a reader study with two radiologists and GPT-4 out-of-the-box to evaluate the AI-generated ‘Impression’ sections in comparison to the originals. Theresults revealed that human experts rated original impressions higher than AI-generatedones across correctness, completeness, and conciseness, highlighting a gap in the AI’s abil-ity to generate clinically reliable medical text. Additionally, GPT-4’s evaluations weremore favorable towards AI-generated content, indicating limitations in its out-of-the-boxuse as an evaluator in specialized domains. The study emphasizes the need for cautiousintegration of LLMs into medical domains and the importance of expert validation, yetalso acknowledges the inherent subjectivity in interpreting and evaluating medical reports.