Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Zahra Babaiee, Peyman Kiasari, Daniela Rus, Radu Grosu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:2081-2113, 2025.

Abstract

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, ‘conceptualization’—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: https://vga.csail.mit.edu/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-babaiee25a, title = {Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models}, author = {Babaiee, Zahra and Kiasari, Peyman and Rus, Daniela and Grosu, Radu}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {2081--2113}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/babaiee25a/babaiee25a.pdf}, url = {https://proceedings.mlr.press/v267/babaiee25a.html}, abstract = {Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, ‘conceptualization’—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: https://vga.csail.mit.edu/.} }
Endnote
%0 Conference Paper %T Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models %A Zahra Babaiee %A Peyman Kiasari %A Daniela Rus %A Radu Grosu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-babaiee25a %I PMLR %P 2081--2113 %U https://proceedings.mlr.press/v267/babaiee25a.html %V 267 %X Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, ‘conceptualization’—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: https://vga.csail.mit.edu/.
APA
Babaiee, Z., Kiasari, P., Rus, D. & Grosu, R.. (2025). Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:2081-2113 Available from https://proceedings.mlr.press/v267/babaiee25a.html.

Related Material