Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models

Konstantinos Vilouras, Ilias Stogiannidis, Junyu Yan, Alison Q. Smithard, Sotirios A. Tsaftaris
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2864-2892, 2026.

Abstract

Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). We further validate our approach through a pilot qualitative study and an experiment on grounded disease classification.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-vilouras26a, title = {Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models}, author = {Vilouras, Konstantinos and Stogiannidis, Ilias and Yan, Junyu and Smithard, Alison Q. and Tsaftaris, Sotirios A.}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {2864--2892}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/vilouras26a/vilouras26a.pdf}, url = {https://proceedings.mlr.press/v315/vilouras26a.html}, abstract = {Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). We further validate our approach through a pilot qualitative study and an experiment on grounded disease classification.} }
Endnote
%0 Conference Paper %T Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models %A Konstantinos Vilouras %A Ilias Stogiannidis %A Junyu Yan %A Alison Q. Smithard %A Sotirios A. Tsaftaris %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-vilouras26a %I PMLR %P 2864--2892 %U https://proceedings.mlr.press/v315/vilouras26a.html %V 315 %X Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). We further validate our approach through a pilot qualitative study and an experiment on grounded disease classification.
APA
Vilouras, K., Stogiannidis, I., Yan, J., Smithard, A.Q. & Tsaftaris, S.A.. (2026). Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2864-2892 Available from https://proceedings.mlr.press/v315/vilouras26a.html.

Related Material