Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

Rhydian Windsor; Amir Jamaludin; Timor Kadir; Andrew Zisserman

Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman

Medical Imaging with Deep Learning, PMLR 227:53-73, 2024.

Abstract

This paper explores training medical vision-language models (VLMs) – where the visual and language inputs are embedded into a common space – with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text domains (i.e. medical imaging and reports) via unimodal self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE) contrastive loss functions as well as a combination of the two; (iii) extra supervision during VLM training, via: (a) image- and text-only self-supervision, and (b) creating additional positive image-text pairs for training through augmentation and nearest-neighbour search. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports. Combined, they significantly improve retrieval compared to fine-tuning CLIP, roughly equivalent to training with $10\times$ the data. A similar pattern is found in the downstream task classification of CXR-related conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM benchmark, in the zero-shot and linear probing settings. We conclude with a set of recommendations for researchers aiming to train vision-language models on other medical imaging modalities when training data is scarce. To facilitate further research, we will make our code and models publicly available.

Cite this Paper

BibTeX


@InProceedings{pmlr-v227-windsor24a,
  title = 	 {Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime},
  author =       {Windsor, Rhydian and Jamaludin, Amir and Kadir, Timor and Zisserman, Andrew},
  booktitle = 	 {Medical Imaging with Deep Learning},
  pages = 	 {53--73},
  year = 	 {2024},
  editor = 	 {Oguz, Ipek and Noble, Jack and Li, Xiaoxiao and Styner, Martin and Baumgartner, Christian and Rusu, Mirabela and Heinmann, Tobias and Kontos, Despina and Landman, Bennett and Dawant, Benoit},
  volume = 	 {227},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--12 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v227/windsor24a/windsor24a.pdf},
  url = 	 {https://proceedings.mlr.press/v227/windsor24a.html},
  abstract = 	 {This paper explores training medical vision-language models (VLMs) – where the visual and language inputs are embedded into a common space – with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text domains (i.e. medical imaging and reports) via unimodal self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE) contrastive loss functions as well as a combination of the two; (iii) extra supervision during VLM training, via: (a) image- and text-only self-supervision, and (b) creating additional positive image-text pairs for training through augmentation and nearest-neighbour search. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports. Combined, they significantly improve retrieval compared to fine-tuning CLIP, roughly equivalent to training with $10\times$ the data. A similar pattern is found in the downstream task classification of CXR-related conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM benchmark, in the zero-shot and linear probing settings. We conclude with a set of recommendations for researchers aiming to train vision-language models on other medical imaging modalities when training data is scarce. To facilitate further research, we will make our code and models publicly available.}
}

Endnote

%0 Conference Paper
%T Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime
%A Rhydian Windsor
%A Amir Jamaludin
%A Timor Kadir
%A Andrew Zisserman
%B Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ipek Oguz
%E Jack Noble
%E Xiaoxiao Li
%E Martin Styner
%E Christian Baumgartner
%E Mirabela Rusu
%E Tobias Heinmann
%E Despina Kontos
%E Bennett Landman
%E Benoit Dawant	
%F pmlr-v227-windsor24a
%I PMLR
%P 53--73
%U https://proceedings.mlr.press/v227/windsor24a.html
%V 227
%X This paper explores training medical vision-language models (VLMs) – where the visual and language inputs are embedded into a common space – with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text domains (i.e. medical imaging and reports) via unimodal self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE) contrastive loss functions as well as a combination of the two; (iii) extra supervision during VLM training, via: (a) image- and text-only self-supervision, and (b) creating additional positive image-text pairs for training through augmentation and nearest-neighbour search. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports. Combined, they significantly improve retrieval compared to fine-tuning CLIP, roughly equivalent to training with $10\times$ the data. A similar pattern is found in the downstream task classification of CXR-related conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM benchmark, in the zero-shot and linear probing settings. We conclude with a set of recommendations for researchers aiming to train vision-language models on other medical imaging modalities when training data is scarce. To facilitate further research, we will make our code and models publicly available.

APA


Windsor, R., Jamaludin, A., Kadir, T. & Zisserman, A.. (2024). Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime. Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 227:53-73 Available from https://proceedings.mlr.press/v227/windsor24a.html.

Related Material

Download PDF