The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models

Chenwei Wu, Li Erran Li, Stefano Ermon, Patrick Haffner, Rong Ge, Zaiwei Zhang
Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, PMLR 239:118-126, 2023.

Abstract

Compositionality is a common property in many modalities including text and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image, as the strength of the language model in detecting sentences that are syntactically and semantically likely overwhelms the vision part of the model. We find in particular that a benchmark for compositionality mostly favors pure language models. Finally, we propose a new benchmark for compositionality without such linguistic priors

Cite this Paper


BibTeX
@InProceedings{pmlr-v239-wu23a, title = {The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models}, author = {Wu, Chenwei and Li, Li Erran and Ermon, Stefano and Haffner, Patrick and Ge, Rong and Zhang, Zaiwei}, booktitle = {Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops}, pages = {118--126}, year = {2023}, editor = {AntorĂ¡n, Javier and Blaas, Arno and Buchanan, Kelly and Feng, Fan and Fortuin, Vincent and Ghalebikesabi, Sahra and Kriegler, Andreas and Mason, Ian and Rohde, David and Ruiz, Francisco J. R. and Uelwer, Tobias and Xie, Yubin and Yang, Rui}, volume = {239}, series = {Proceedings of Machine Learning Research}, month = {16 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v239/wu23a/wu23a.pdf}, url = {https://proceedings.mlr.press/v239/wu23a.html}, abstract = {Compositionality is a common property in many modalities including text and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image, as the strength of the language model in detecting sentences that are syntactically and semantically likely overwhelms the vision part of the model. We find in particular that a benchmark for compositionality mostly favors pure language models. Finally, we propose a new benchmark for compositionality without such linguistic priors} }
Endnote
%0 Conference Paper %T The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models %A Chenwei Wu %A Li Erran Li %A Stefano Ermon %A Patrick Haffner %A Rong Ge %A Zaiwei Zhang %B Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops %C Proceedings of Machine Learning Research %D 2023 %E Javier AntorĂ¡n %E Arno Blaas %E Kelly Buchanan %E Fan Feng %E Vincent Fortuin %E Sahra Ghalebikesabi %E Andreas Kriegler %E Ian Mason %E David Rohde %E Francisco J. R. Ruiz %E Tobias Uelwer %E Yubin Xie %E Rui Yang %F pmlr-v239-wu23a %I PMLR %P 118--126 %U https://proceedings.mlr.press/v239/wu23a.html %V 239 %X Compositionality is a common property in many modalities including text and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image, as the strength of the language model in detecting sentences that are syntactically and semantically likely overwhelms the vision part of the model. We find in particular that a benchmark for compositionality mostly favors pure language models. Finally, we propose a new benchmark for compositionality without such linguistic priors
APA
Wu, C., Li, L.E., Ermon, S., Haffner, P., Ge, R. & Zhang, Z.. (2023). The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, in Proceedings of Machine Learning Research 239:118-126 Available from https://proceedings.mlr.press/v239/wu23a.html.

Related Material