On Path to Multimodal Generalist: General-Level and General-Bench

Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Weiming Wu, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16423-16542, 2025.

Abstract

The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named General-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project Page: https://generalist.top/, Leaderboard: https://generalist.top/leaderboard/, Benchmark: https://huggingface.co/General-Level/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-fei25a, title = {On Path to Multimodal Generalist: General-Level and General-Bench}, author = {Fei, Hao and Zhou, Yuan and Li, Juncheng and Li, Xiangtai and Xu, Qingshan and Li, Bobo and Wu, Shengqiong and Wang, Yaoting and Zhou, Junbao and Meng, Jiahao and Shi, Qingyu and Zhou, Zhiyuan and Shi, Liangtao and Gao, Minghe and Zhang, Daoan and Ge, Zhiqi and Tang, Siliang and Pan, Kaihang and Ye, Yaobo and Yuan, Haobo and Zhang, Tao and Wu, Weiming and Ju, Tianjie and Meng, Zixiang and Xu, Shilin and Jia, Liyu and Hu, Wentao and Luo, Meng and Luo, Jiebo and Chua, Tat-Seng and Yan, Shuicheng and Zhang, Hanwang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {16423--16542}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/fei25a/fei25a.pdf}, url = {https://proceedings.mlr.press/v267/fei25a.html}, abstract = {The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named General-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project Page: https://generalist.top/, Leaderboard: https://generalist.top/leaderboard/, Benchmark: https://huggingface.co/General-Level/.} }
Endnote
%0 Conference Paper %T On Path to Multimodal Generalist: General-Level and General-Bench %A Hao Fei %A Yuan Zhou %A Juncheng Li %A Xiangtai Li %A Qingshan Xu %A Bobo Li %A Shengqiong Wu %A Yaoting Wang %A Junbao Zhou %A Jiahao Meng %A Qingyu Shi %A Zhiyuan Zhou %A Liangtao Shi %A Minghe Gao %A Daoan Zhang %A Zhiqi Ge %A Siliang Tang %A Kaihang Pan %A Yaobo Ye %A Haobo Yuan %A Tao Zhang %A Weiming Wu %A Tianjie Ju %A Zixiang Meng %A Shilin Xu %A Liyu Jia %A Wentao Hu %A Meng Luo %A Jiebo Luo %A Tat-Seng Chua %A Shuicheng Yan %A Hanwang Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-fei25a %I PMLR %P 16423--16542 %U https://proceedings.mlr.press/v267/fei25a.html %V 267 %X The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named General-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project Page: https://generalist.top/, Leaderboard: https://generalist.top/leaderboard/, Benchmark: https://huggingface.co/General-Level/.
APA
Fei, H., Zhou, Y., Li, J., Li, X., Xu, Q., Li, B., Wu, S., Wang, Y., Zhou, J., Meng, J., Shi, Q., Zhou, Z., Shi, L., Gao, M., Zhang, D., Ge, Z., Tang, S., Pan, K., Ye, Y., Yuan, H., Zhang, T., Wu, W., Ju, T., Meng, Z., Xu, S., Jia, L., Hu, W., Luo, M., Luo, J., Chua, T., Yan, S. & Zhang, H.. (2025). On Path to Multimodal Generalist: General-Level and General-Bench. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16423-16542 Available from https://proceedings.mlr.press/v267/fei25a.html.

Related Material