How (not) to ensemble LVLMs for VQA

Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink
Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, PMLR 239:1-20, 2023.

Abstract

This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to mod- els including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment (Fig. 1) shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?

Cite this Paper


BibTeX
@InProceedings{pmlr-v239-alazraki23a, title = {How (not) to ensemble LVLMs for VQA}, author = {Alazraki, Lisa and Castrejon, Lluis and Dehghani, Mostafa and Huot, Fantine and Uijlings, Jasper and Mensink, Thomas}, booktitle = {Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops}, pages = {1--20}, year = {2023}, editor = {AntorĂ¡n, Javier and Blaas, Arno and Buchanan, Kelly and Feng, Fan and Fortuin, Vincent and Ghalebikesabi, Sahra and Kriegler, Andreas and Mason, Ian and Rohde, David and Ruiz, Francisco J. R. and Uelwer, Tobias and Xie, Yubin and Yang, Rui}, volume = {239}, series = {Proceedings of Machine Learning Research}, month = {16 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v239/alazraki23a/alazraki23a.pdf}, url = {https://proceedings.mlr.press/v239/alazraki23a.html}, abstract = {This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to mod- els including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment (Fig. 1) shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?} }
Endnote
%0 Conference Paper %T How (not) to ensemble LVLMs for VQA %A Lisa Alazraki %A Lluis Castrejon %A Mostafa Dehghani %A Fantine Huot %A Jasper Uijlings %A Thomas Mensink %B Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops %C Proceedings of Machine Learning Research %D 2023 %E Javier AntorĂ¡n %E Arno Blaas %E Kelly Buchanan %E Fan Feng %E Vincent Fortuin %E Sahra Ghalebikesabi %E Andreas Kriegler %E Ian Mason %E David Rohde %E Francisco J. R. Ruiz %E Tobias Uelwer %E Yubin Xie %E Rui Yang %F pmlr-v239-alazraki23a %I PMLR %P 1--20 %U https://proceedings.mlr.press/v239/alazraki23a.html %V 239 %X This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to mod- els including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment (Fig. 1) shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?
APA
Alazraki, L., Castrejon, L., Dehghani, M., Huot, F., Uijlings, J. & Mensink, T.. (2023). How (not) to ensemble LVLMs for VQA. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, in Proceedings of Machine Learning Research 239:1-20 Available from https://proceedings.mlr.press/v239/alazraki23a.html.

Related Material