Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:6216-6234, 2022.

Abstract

Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-fang22a, title = {Data Determines Distributional Robustness in Contrastive Language Image Pre-training ({CLIP})}, author = {Fang, Alex and Ilharco, Gabriel and Wortsman, Mitchell and Wan, Yuhao and Shankar, Vaishaal and Dave, Achal and Schmidt, Ludwig}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {6216--6234}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/fang22a/fang22a.pdf}, url = {https://proceedings.mlr.press/v162/fang22a.html}, abstract = {Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.} }
Endnote
%0 Conference Paper %T Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) %A Alex Fang %A Gabriel Ilharco %A Mitchell Wortsman %A Yuhao Wan %A Vaishaal Shankar %A Achal Dave %A Ludwig Schmidt %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-fang22a %I PMLR %P 6216--6234 %U https://proceedings.mlr.press/v162/fang22a.html %V 162 %X Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.
APA
Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A. & Schmidt, L.. (2022). Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP). Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:6216-6234 Available from https://proceedings.mlr.press/v162/fang22a.html.

Related Material