Diffusion-based Visual Representation Learning for Medical Question Answering

Dexin Bian; Xiaoru Wang; Meifang Li

Diffusion-based Visual Representation Learning for Medical Question Answering

Dexin Bian, Xiaoru Wang, Meifang Li

Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:169-184, 2024.

Abstract

Medical visual question answering (Med-VQA) aims to correctly answer the medical question based on the given image. One of the major challenges is the scarcity of large professional labeled datasets for training, which poses obstacles to feature extraction, especially for medical images. To overcome it, we propose a method to learn transferable visual representation based on conditional denoising diffusion probabilistic model(conditional DDPM).Specifically, we collate a large amount of unlabeled radiological images and train a conditional DDPM with the paradigm of auto-encoder to obtain a model which can extract high-level semantic information from medical images.The pre-trained model can be used as a well initialized visual feature extractor and can be easily adapt to any Med-VQA systems. We build our Med-VQA system follow the state-of-the-art Med-VQA architecture and replace the visual extractor with our pre-trained model.Our proposal method outperforms the state-of-the-art Med-VQA method on VQA-RAD and achieves comparable result on SLAKE.

Cite this Paper

BibTeX


@InProceedings{pmlr-v222-bian24a,
  title = 	 {Diffusion-based Visual Representation Learning for Medical Question Answering},
  author =       {Bian, Dexin and Wang, Xiaoru and Li, Meifang},
  booktitle = 	 {Proceedings of the 15th Asian Conference on Machine Learning},
  pages = 	 {169--184},
  year = 	 {2024},
  editor = 	 {Yanıkoğlu, Berrin and Buntine, Wray},
  volume = 	 {222},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v222/bian24a/bian24a.pdf},
  url = 	 {https://proceedings.mlr.press/v222/bian24a.html},
  abstract = 	 {Medical visual question answering (Med-VQA) aims to correctly answer the medical question based on the given image. One of the major challenges is the scarcity of large professional labeled datasets for training, which poses obstacles to feature extraction, especially for medical images. To overcome it, we propose a method to learn transferable visual representation based on conditional denoising diffusion probabilistic model(conditional DDPM).Specifically, we collate a large amount of unlabeled radiological images and train a conditional DDPM with the paradigm of auto-encoder to obtain a model which can extract high-level semantic information from medical images.The pre-trained model can be used as a well initialized visual feature extractor and can be easily adapt to any Med-VQA systems. We build our Med-VQA system follow the state-of-the-art Med-VQA architecture and replace the visual extractor with our pre-trained model.Our proposal method outperforms the state-of-the-art Med-VQA method on VQA-RAD and achieves comparable result on SLAKE.}
}

Endnote

%0 Conference Paper
%T Diffusion-based Visual Representation Learning for Medical Question Answering
%A Dexin Bian
%A Xiaoru Wang
%A Meifang Li
%B Proceedings of the 15th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Berrin Yanıkoğlu
%E Wray Buntine	
%F pmlr-v222-bian24a
%I PMLR
%P 169--184
%U https://proceedings.mlr.press/v222/bian24a.html
%V 222
%X Medical visual question answering (Med-VQA) aims to correctly answer the medical question based on the given image. One of the major challenges is the scarcity of large professional labeled datasets for training, which poses obstacles to feature extraction, especially for medical images. To overcome it, we propose a method to learn transferable visual representation based on conditional denoising diffusion probabilistic model(conditional DDPM).Specifically, we collate a large amount of unlabeled radiological images and train a conditional DDPM with the paradigm of auto-encoder to obtain a model which can extract high-level semantic information from medical images.The pre-trained model can be used as a well initialized visual feature extractor and can be easily adapt to any Med-VQA systems. We build our Med-VQA system follow the state-of-the-art Med-VQA architecture and replace the visual extractor with our pre-trained model.Our proposal method outperforms the state-of-the-art Med-VQA method on VQA-RAD and achieves comparable result on SLAKE.

APA


Bian, D., Wang, X. & Li, M.. (2024). Diffusion-based Visual Representation Learning for Medical Question Answering. Proceedings of the 15th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 222:169-184 Available from https://proceedings.mlr.press/v222/bian24a.html.

Related Material

Download PDF