Does Grounding Improve Radiology Report Generation? An Empirical Study on PadChest-GR

Mohamed Aas-Alas, Alberto Albiol, Roberto Paredes
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:1375-1391, 2026.

Abstract

Radiology Report Generation (RRG) aims to automatically produce clinically accurate descriptions of medical images, yet current models often struggle with incomplete findings, generic phrasing, and hallucinations due to the absence of explicit grounding signals. To address these limitations, we propose a grounding-based RRG framework that integrates spatially localized visual evidence into the generation process. Our approach combines a vision encoder ViT with a language decoder LLM GPT-2 through a lightweight transformer-based bridging module inspired by Bridge-Enhanced Vision Encoder–Decoder (VED) architectures. Grounding is introduced using bounding boxes of anatomical regions and pathologies, enabling the model to attend to both global and localized features. We further define a adopt the region-to-text task, where the model generates findings directly from specific regions of interest. Experiments on the PadChest-GR dataset demonstrate that grounding substantially improves linguistic quality and clinical accuracy, with the full image plus grounding mask configuration achieving the strongest gains across BLEU, ROUGE-L, CIDEr, BERTScore, CheXbert F1, and RadGraph F1. Analyses also show that even partial or noisy grounding yields consistent benefits.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-aas-alas26a, title = {Does Grounding Improve Radiology Report Generation? An Empirical Study on PadChest-GR}, author = {Aas-Alas, Mohamed and Albiol, Alberto and Paredes, Roberto}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {1375--1391}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/aas-alas26a/aas-alas26a.pdf}, url = {https://proceedings.mlr.press/v315/aas-alas26a.html}, abstract = {Radiology Report Generation (RRG) aims to automatically produce clinically accurate descriptions of medical images, yet current models often struggle with incomplete findings, generic phrasing, and hallucinations due to the absence of explicit grounding signals. To address these limitations, we propose a grounding-based RRG framework that integrates spatially localized visual evidence into the generation process. Our approach combines a vision encoder ViT with a language decoder LLM GPT-2 through a lightweight transformer-based bridging module inspired by Bridge-Enhanced Vision Encoder–Decoder (VED) architectures. Grounding is introduced using bounding boxes of anatomical regions and pathologies, enabling the model to attend to both global and localized features. We further define a adopt the region-to-text task, where the model generates findings directly from specific regions of interest. Experiments on the PadChest-GR dataset demonstrate that grounding substantially improves linguistic quality and clinical accuracy, with the full image plus grounding mask configuration achieving the strongest gains across BLEU, ROUGE-L, CIDEr, BERTScore, CheXbert F1, and RadGraph F1. Analyses also show that even partial or noisy grounding yields consistent benefits.} }
Endnote
%0 Conference Paper %T Does Grounding Improve Radiology Report Generation? An Empirical Study on PadChest-GR %A Mohamed Aas-Alas %A Alberto Albiol %A Roberto Paredes %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-aas-alas26a %I PMLR %P 1375--1391 %U https://proceedings.mlr.press/v315/aas-alas26a.html %V 315 %X Radiology Report Generation (RRG) aims to automatically produce clinically accurate descriptions of medical images, yet current models often struggle with incomplete findings, generic phrasing, and hallucinations due to the absence of explicit grounding signals. To address these limitations, we propose a grounding-based RRG framework that integrates spatially localized visual evidence into the generation process. Our approach combines a vision encoder ViT with a language decoder LLM GPT-2 through a lightweight transformer-based bridging module inspired by Bridge-Enhanced Vision Encoder–Decoder (VED) architectures. Grounding is introduced using bounding boxes of anatomical regions and pathologies, enabling the model to attend to both global and localized features. We further define a adopt the region-to-text task, where the model generates findings directly from specific regions of interest. Experiments on the PadChest-GR dataset demonstrate that grounding substantially improves linguistic quality and clinical accuracy, with the full image plus grounding mask configuration achieving the strongest gains across BLEU, ROUGE-L, CIDEr, BERTScore, CheXbert F1, and RadGraph F1. Analyses also show that even partial or noisy grounding yields consistent benefits.
APA
Aas-Alas, M., Albiol, A. & Paredes, R.. (2026). Does Grounding Improve Radiology Report Generation? An Empirical Study on PadChest-GR. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:1375-1391 Available from https://proceedings.mlr.press/v315/aas-alas26a.html.

Related Material