Uncertainty Estimation in Large Vision Language Models for Automated Radiology Report Generation

Jenny Xu
Proceedings of the 4th Machine Learning for Health Symposium, PMLR 259:1039-1052, 2025.

Abstract

The automated generation of free-text radiology reports is crucial for improving diagnosis and treatment in clinical practice. The latest chest X-ray report generation models utilize large vision language model (LVLM) architectures, which demand a higher level of interpretability for clinical deployment. Uncertainty estimation scores can assist clinicians in evaluating the reliability of these model outputs and promoting broader adoption of automated systems. In this paper, we conduct a comprehensive evaluation of the correlation between 16 LLM uncertainty scores and 6 radiology report evaluation metrics across 4 state-of-the-art LVLMs for CXR report generation. Our findings show a strong Pearson correlation, ranging from 0.4 to 0.6 on a scale from -1 to 1, for several models. We provide a detailed analysis of these uncertainty scores and evaluation metrics, offering insights in applying these methods in real clinical settings. This study is the first to evaluate LLM-based uncertainty estimation scores for X-ray report generation LVLM models, establishing a benchmark and laying the groundwork for their adoption in clinical practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v259-xu25a, title = {Uncertainty Estimation in Large Vision Language Models for Automated Radiology Report Generation}, author = {Xu, Jenny}, booktitle = {Proceedings of the 4th Machine Learning for Health Symposium}, pages = {1039--1052}, year = {2025}, editor = {Hegselmann, Stefan and Zhou, Helen and Healey, Elizabeth and Chang, Trenton and Ellington, Caleb and Mhasawade, Vishwali and Tonekaboni, Sana and Argaw, Peniel and Zhang, Haoran}, volume = {259}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v259/main/assets/xu25a/xu25a.pdf}, url = {https://proceedings.mlr.press/v259/xu25a.html}, abstract = {The automated generation of free-text radiology reports is crucial for improving diagnosis and treatment in clinical practice. The latest chest X-ray report generation models utilize large vision language model (LVLM) architectures, which demand a higher level of interpretability for clinical deployment. Uncertainty estimation scores can assist clinicians in evaluating the reliability of these model outputs and promoting broader adoption of automated systems. In this paper, we conduct a comprehensive evaluation of the correlation between 16 LLM uncertainty scores and 6 radiology report evaluation metrics across 4 state-of-the-art LVLMs for CXR report generation. Our findings show a strong Pearson correlation, ranging from 0.4 to 0.6 on a scale from -1 to 1, for several models. We provide a detailed analysis of these uncertainty scores and evaluation metrics, offering insights in applying these methods in real clinical settings. This study is the first to evaluate LLM-based uncertainty estimation scores for X-ray report generation LVLM models, establishing a benchmark and laying the groundwork for their adoption in clinical practice.} }
Endnote
%0 Conference Paper %T Uncertainty Estimation in Large Vision Language Models for Automated Radiology Report Generation %A Jenny Xu %B Proceedings of the 4th Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2025 %E Stefan Hegselmann %E Helen Zhou %E Elizabeth Healey %E Trenton Chang %E Caleb Ellington %E Vishwali Mhasawade %E Sana Tonekaboni %E Peniel Argaw %E Haoran Zhang %F pmlr-v259-xu25a %I PMLR %P 1039--1052 %U https://proceedings.mlr.press/v259/xu25a.html %V 259 %X The automated generation of free-text radiology reports is crucial for improving diagnosis and treatment in clinical practice. The latest chest X-ray report generation models utilize large vision language model (LVLM) architectures, which demand a higher level of interpretability for clinical deployment. Uncertainty estimation scores can assist clinicians in evaluating the reliability of these model outputs and promoting broader adoption of automated systems. In this paper, we conduct a comprehensive evaluation of the correlation between 16 LLM uncertainty scores and 6 radiology report evaluation metrics across 4 state-of-the-art LVLMs for CXR report generation. Our findings show a strong Pearson correlation, ranging from 0.4 to 0.6 on a scale from -1 to 1, for several models. We provide a detailed analysis of these uncertainty scores and evaluation metrics, offering insights in applying these methods in real clinical settings. This study is the first to evaluate LLM-based uncertainty estimation scores for X-ray report generation LVLM models, establishing a benchmark and laying the groundwork for their adoption in clinical practice.
APA
Xu, J.. (2025). Uncertainty Estimation in Large Vision Language Models for Automated Radiology Report Generation. Proceedings of the 4th Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 259:1039-1052 Available from https://proceedings.mlr.press/v259/xu25a.html.

Related Material