TIER: Text-Image Entropy Regularization for Medical CLIP-style models

Anil Palepu; Andrew Beam

TIER: Text-Image Entropy Regularization for Medical CLIP-style models

Anil Palepu, Andrew Beam

Proceedings of the 8th Machine Learning for Healthcare Conference, PMLR 219:548-564, 2023.

Abstract

In this paper, we introduce a novel regularization scheme on contrastive language-image pre-trained (CLIP) medical vision models. Our approach is based on the observation that, for many medical imaging tasks, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme improves localization by shrinking most of the pairwise text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context, chest x-rays, where this underlying sparsity hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) average zero-shot performance on the CheXpert and Padchest chest x-ray datasets, outperforming an unregularized version of the model and several recently published self-supervised models.

Cite this Paper

BibTeX


@InProceedings{pmlr-v219-palepu23a,
  title = 	 {TIER: Text-Image Entropy Regularization for Medical CLIP-style models},
  author =       {Palepu, Anil and Beam, Andrew},
  booktitle = 	 {Proceedings of the 8th Machine Learning for Healthcare Conference},
  pages = 	 {548--564},
  year = 	 {2023},
  editor = 	 {Deshpande, Kaivalya and Fiterau, Madalina and Joshi, Shalmali and Lipton, Zachary and Ranganath, Rajesh and Urteaga, Iñigo and Yeung, Serene},
  volume = 	 {219},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--12 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v219/palepu23a/palepu23a.pdf},
  url = 	 {https://proceedings.mlr.press/v219/palepu23a.html},
  abstract = 	 {In this paper, we introduce a novel regularization scheme on contrastive language-image pre-trained (CLIP) medical vision models. Our approach is based on the observation that, for many medical imaging tasks, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme improves localization by shrinking most of the pairwise text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context, chest x-rays, where this underlying sparsity hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) average zero-shot performance on the CheXpert and Padchest chest x-ray datasets, outperforming an unregularized version of the model and several recently published self-supervised models.}
}

Endnote

%0 Conference Paper
%T TIER: Text-Image Entropy Regularization for Medical CLIP-style models
%A Anil Palepu
%A Andrew Beam
%B Proceedings of the 8th Machine Learning for Healthcare Conference
%C Proceedings of Machine Learning Research
%D 2023
%E Kaivalya Deshpande
%E Madalina Fiterau
%E Shalmali Joshi
%E Zachary Lipton
%E Rajesh Ranganath
%E Iñigo Urteaga
%E Serene Yeung	
%F pmlr-v219-palepu23a
%I PMLR
%P 548--564
%U https://proceedings.mlr.press/v219/palepu23a.html
%V 219
%X In this paper, we introduce a novel regularization scheme on contrastive language-image pre-trained (CLIP) medical vision models. Our approach is based on the observation that, for many medical imaging tasks, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme improves localization by shrinking most of the pairwise text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context, chest x-rays, where this underlying sparsity hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) average zero-shot performance on the CheXpert and Padchest chest x-ray datasets, outperforming an unregularized version of the model and several recently published self-supervised models.

APA


Palepu, A. & Beam, A.. (2023). TIER: Text-Image Entropy Regularization for Medical CLIP-style models. Proceedings of the 8th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 219:548-564 Available from https://proceedings.mlr.press/v219/palepu23a.html.

Related Material

Download PDF