Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Jinxia Yang; Bing Su; Xin Zhao; Ji-Rong Wen

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Jinxia Yang, Bing Su, Xin Zhao, Ji-Rong Wen

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:56382-56396, 2024.

Abstract

Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-yang24v,
  title = 	 {Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training},
  author =       {Yang, Jinxia and Su, Bing and Zhao, Xin and Wen, Ji-Rong},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {56382--56396},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/yang24v/yang24v.pdf},
  url = 	 {https://proceedings.mlr.press/v235/yang24v.html},
  abstract = 	 {Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.}
}

Endnote

%0 Conference Paper
%T Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training
%A Jinxia Yang
%A Bing Su
%A Xin Zhao
%A Ji-Rong Wen
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-yang24v
%I PMLR
%P 56382--56396
%U https://proceedings.mlr.press/v235/yang24v.html
%V 235
%X Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.

APA


Yang, J., Su, B., Zhao, X. & Wen, J.. (2024). Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:56382-56396 Available from https://proceedings.mlr.press/v235/yang24v.html.

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Abstract

Cite this Paper

Related Material