Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Li Xu, Bo Liu, Ameer Hamza Khan, Lu Fan, Xiao-Ming Wu
Proceedings of the Conference on Health, Inference, and Learning, PMLR 209:117-132, 2023.

Abstract

With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v209-xu23a, title = {Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark}, author = {Xu, Li and Liu, Bo and Khan, Ameer Hamza and Fan, Lu and Wu, Xiao-Ming}, booktitle = {Proceedings of the Conference on Health, Inference, and Learning}, pages = {117--132}, year = {2023}, editor = {Mortazavi, Bobak J. and Sarker, Tasmie and Beam, Andrew and Ho, Joyce C.}, volume = {209}, series = {Proceedings of Machine Learning Research}, month = {22 Jun--24 Jun}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v209/xu23a/xu23a.pdf}, url = {https://proceedings.mlr.press/v209/xu23a.html}, abstract = {With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.} }
Endnote
%0 Conference Paper %T Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark %A Li Xu %A Bo Liu %A Ameer Hamza Khan %A Lu Fan %A Xiao-Ming Wu %B Proceedings of the Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2023 %E Bobak J. Mortazavi %E Tasmie Sarker %E Andrew Beam %E Joyce C. Ho %F pmlr-v209-xu23a %I PMLR %P 117--132 %U https://proceedings.mlr.press/v209/xu23a.html %V 209 %X With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.
APA
Xu, L., Liu, B., Khan, A.H., Fan, L. & Wu, X.. (2023). Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark. Proceedings of the Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 209:117-132 Available from https://proceedings.mlr.press/v209/xu23a.html.

Related Material