Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:26070-26084, 2024.

Abstract

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector – limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip’s vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-lavoie24a, title = {Modeling Caption Diversity in Contrastive Vision-Language Pretraining}, author = {Lavoie, Samuel and Kirichenko, Polina and Ibrahim, Mark and Assran, Mido and Wilson, Andrew Gordon and Courville, Aaron and Ballas, Nicolas}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {26070--26084}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/lavoie24a/lavoie24a.pdf}, url = {https://proceedings.mlr.press/v235/lavoie24a.html}, abstract = {There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector – limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip’s vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.} }
Endnote
%0 Conference Paper %T Modeling Caption Diversity in Contrastive Vision-Language Pretraining %A Samuel Lavoie %A Polina Kirichenko %A Mark Ibrahim %A Mido Assran %A Andrew Gordon Wilson %A Aaron Courville %A Nicolas Ballas %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-lavoie24a %I PMLR %P 26070--26084 %U https://proceedings.mlr.press/v235/lavoie24a.html %V 235 %X There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector – limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip’s vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
APA
Lavoie, S., Kirichenko, P., Ibrahim, M., Assran, M., Wilson, A.G., Courville, A. & Ballas, N.. (2024). Modeling Caption Diversity in Contrastive Vision-Language Pretraining. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:26070-26084 Available from https://proceedings.mlr.press/v235/lavoie24a.html.

Related Material