CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Aman Shrivastava; Ramprasaath R. Selvaraju; Nikhil Naik; Vicente Ordonez

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:8433-8447, 2023.

Abstract

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0

$%$ mAP absolute gain in performance on Pascal VOC classification, and a +22.1

$%$ top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-shrivastava23a,
  title = 	 {CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision},
  author =       {Shrivastava, Aman and Selvaraju, Ramprasaath R. and Naik, Nikhil and Ordonez, Vicente},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {8433--8447},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/shrivastava23a/shrivastava23a.pdf},
  url = 	 {https://proceedings.mlr.press/v206/shrivastava23a.html},
  abstract = 	 {We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0$%$ mAP absolute gain in performance on Pascal VOC classification, and a +22.1$%$ top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite}
}

Endnote

%0 Conference Paper
%T CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision
%A Aman Shrivastava
%A Ramprasaath R. Selvaraju
%A Nikhil Naik
%A Vicente Ordonez
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-shrivastava23a
%I PMLR
%P 8433--8447
%U https://proceedings.mlr.press/v206/shrivastava23a.html
%V 206
%X We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0$%$ mAP absolute gain in performance on Pascal VOC classification, and a +22.1$%$ top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite

APA


Shrivastava, A., Selvaraju, R.R., Naik, N. & Ordonez, V.. (2023). CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:8433-8447 Available from https://proceedings.mlr.press/v206/shrivastava23a.html.

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Abstract

Cite this Paper

Related Material