MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:57116-57198, 2024.

Abstract

Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-ying24a, title = {{MMT}-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask {AGI}}, author = {Ying, Kaining and Meng, Fanqing and Wang, Jin and Li, Zhiqian and Lin, Han and Yang, Yue and Zhang, Hao and Zhang, Wenbo and Lin, Yuqi and Liu, Shuo and Lei, Jiayi and Lu, Quanfeng and Chen, Runjian and Xu, Peng and Zhang, Renrui and Zhang, Haozhe and Gao, Peng and Wang, Yali and Qiao, Yu and Luo, Ping and Zhang, Kaipeng and Shao, Wenqi}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {57116--57198}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ying24a/ying24a.pdf}, url = {https://proceedings.mlr.press/v235/ying24a.html}, abstract = {Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.} }
Endnote
%0 Conference Paper %T MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI %A Kaining Ying %A Fanqing Meng %A Jin Wang %A Zhiqian Li %A Han Lin %A Yue Yang %A Hao Zhang %A Wenbo Zhang %A Yuqi Lin %A Shuo Liu %A Jiayi Lei %A Quanfeng Lu %A Runjian Chen %A Peng Xu %A Renrui Zhang %A Haozhe Zhang %A Peng Gao %A Yali Wang %A Yu Qiao %A Ping Luo %A Kaipeng Zhang %A Wenqi Shao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-ying24a %I PMLR %P 57116--57198 %U https://proceedings.mlr.press/v235/ying24a.html %V 235 %X Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
APA
Ying, K., Meng, F., Wang, J., Li, Z., Lin, H., Yang, Y., Zhang, H., Zhang, W., Lin, Y., Liu, S., Lei, J., Lu, Q., Chen, R., Xu, P., Zhang, R., Zhang, H., Gao, P., Wang, Y., Qiao, Y., Luo, P., Zhang, K. & Shao, W.. (2024). MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:57116-57198 Available from https://proceedings.mlr.press/v235/ying24a.html.

Related Material