Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Li Xu; Bo Liu; Ameer Hamza Khan; Lu Fan; Xiao-Ming Wu

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Li Xu, Bo Liu, Ameer Hamza Khan, Lu Fan, Xiao-Ming Wu

Proceedings of the Conference on Health, Inference, and Learning, PMLR 209:117-132, 2023.

Abstract

With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v209-xu23a,
  title = 	 {Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark},
  author =       {Xu, Li and Liu, Bo and Khan, Ameer Hamza and Fan, Lu and Wu, Xiao-Ming},
  booktitle = 	 {Proceedings of the Conference on Health, Inference, and Learning},
  pages = 	 {117--132},
  year = 	 {2023},
  editor = 	 {Mortazavi, Bobak J. and Sarker, Tasmie and Beam, Andrew and Ho, Joyce C.},
  volume = 	 {209},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {22 Jun--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v209/xu23a/xu23a.pdf},
  url = 	 {https://proceedings.mlr.press/v209/xu23a.html},
  abstract = 	 {With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.}
}

Endnote

%0 Conference Paper
%T Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
%A Li Xu
%A Bo Liu
%A Ameer Hamza Khan
%A Lu Fan
%A Xiao-Ming Wu
%B Proceedings of the Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Bobak J. Mortazavi
%E Tasmie Sarker
%E Andrew Beam
%E Joyce C. Ho	
%F pmlr-v209-xu23a
%I PMLR
%P 117--132
%U https://proceedings.mlr.press/v209/xu23a.html
%V 209
%X With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.

APA


Xu, L., Liu, B., Khan, A.H., Fan, L. & Wu, X.. (2023). Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark. Proceedings of the Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 209:117-132 Available from https://proceedings.mlr.press/v209/xu23a.html.

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Abstract

Cite this Paper

Related Material