Do Vision-Language Models Really Understand Visual Language?

Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23910-23959, 2025.

Abstract

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of six LVLMs shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hou25c, title = {Do Vision-Language Models Really Understand Visual Language?}, author = {Hou, Yifan and Giledereli, Buse and Tu, Yilei and Sachan, Mrinmaya}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {23910--23959}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hou25c/hou25c.pdf}, url = {https://proceedings.mlr.press/v267/hou25c.html}, abstract = {Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of six LVLMs shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.} }
Endnote
%0 Conference Paper %T Do Vision-Language Models Really Understand Visual Language? %A Yifan Hou %A Buse Giledereli %A Yilei Tu %A Mrinmaya Sachan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hou25c %I PMLR %P 23910--23959 %U https://proceedings.mlr.press/v267/hou25c.html %V 267 %X Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of six LVLMs shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.
APA
Hou, Y., Giledereli, B., Tu, Y. & Sachan, M.. (2025). Do Vision-Language Models Really Understand Visual Language?. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23910-23959 Available from https://proceedings.mlr.press/v267/hou25c.html.

Related Material