VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training

Wangchunshu Zhou; Yan Zeng; Shizhe Diao; Xinsong Zhang

VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training

Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:27395-27411, 2022.

Abstract

Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community’s progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models’ generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark to promote research on building vision-language models that generalize well to images unseen during pre-training and are practical in terms of efficiency-performance trade-off.

Cite this Paper

BibTeX

@InProceedings{pmlr-v162-zhou22n,
  title = 	 {{VLUE}: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training},
  author =       {Zhou, Wangchunshu and Zeng, Yan and Diao, Shizhe and Zhang, Xinsong},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {27395--27411},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/zhou22n/zhou22n.pdf},
  url = 	 {https://proceedings.mlr.press/v162/zhou22n.html},
  abstract = 	 {Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community’s progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models’ generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark to promote research on building vision-language models that generalize well to images unseen during pre-training and are practical in terms of efficiency-performance trade-off.}
}

Endnote

%0 Conference Paper
%T VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training
%A Wangchunshu Zhou
%A Yan Zeng
%A Shizhe Diao
%A Xinsong Zhang
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-zhou22n
%I PMLR
%P 27395--27411
%U https://proceedings.mlr.press/v162/zhou22n.html
%V 162
%X Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community’s progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models’ generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark to promote research on building vision-language models that generalize well to images unseen during pre-training and are practical in terms of efficiency-performance trade-off.

APA

Zhou, W., Zeng, Y., Diao, S. & Zhang, X.. (2022). VLUE: A Multi-Task Multi-Dimension Benchmark for Evaluating Vision-Language Pre-training. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:27395-27411 Available from https://proceedings.mlr.press/v162/zhou22n.html.

Related Material

Download PDF