Contrastive Learning of Medical Visual Representations from Paired Images and Text

Yuhao Zhang; Hang Jiang; Yasuhide Miura; Christopher D. Manning; Curtis P. Langlotz

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, Curtis P. Langlotz

Proceedings of the 7th Machine Learning for Healthcare Conference, PMLR 182:2-25, 2022.

Abstract

Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

Cite this Paper

BibTeX


@InProceedings{pmlr-v182-zhang22a,
  title = 	 {Contrastive Learning of Medical Visual Representations from Paired Images and Text},
  author =       {Zhang, Yuhao and Jiang, Hang and Miura, Yasuhide and Manning, Christopher D. and Langlotz, Curtis P.},
  booktitle = 	 {Proceedings of the 7th Machine Learning for Healthcare Conference},
  pages = 	 {2--25},
  year = 	 {2022},
  editor = 	 {Lipton, Zachary and Ranganath, Rajesh and Sendak, Mark and Sjoding, Michael and Yeung, Serena},
  volume = 	 {182},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {05--06 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v182/zhang22a/zhang22a.pdf},
  url = 	 {https://proceedings.mlr.press/v182/zhang22a.html},
  abstract = 	 {Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.}
}

Endnote

%0 Conference Paper
%T Contrastive Learning of Medical Visual Representations from Paired Images and Text
%A Yuhao Zhang
%A Hang Jiang
%A Yasuhide Miura
%A Christopher D. Manning
%A Curtis P. Langlotz
%B Proceedings of the 7th Machine Learning for Healthcare Conference
%C Proceedings of Machine Learning Research
%D 2022
%E Zachary Lipton
%E Rajesh Ranganath
%E Mark Sendak
%E Michael Sjoding
%E Serena Yeung	
%F pmlr-v182-zhang22a
%I PMLR
%P 2--25
%U https://proceedings.mlr.press/v182/zhang22a.html
%V 182
%X Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

APA


Zhang, Y., Jiang, H., Miura, Y., Manning, C.D. & Langlotz, C.P.. (2022). Contrastive Learning of Medical Visual Representations from Paired Images and Text. Proceedings of the 7th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 182:2-25 Available from https://proceedings.mlr.press/v182/zhang22a.html.

Related Material

Download PDF