MMA:Benchmarking Multi-ModalLarge Language Models in Ambiguity Contexts

Ru Wang, Selena Song, Yuquan Wang, Liang Ding, Mingming Gong, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Conference on Parsimony and Learning, PMLR 328:529-551, 2026.

Abstract

While visual information in multimodal settings can naturally help resolve inherent ambiguities in natural language, the ability of multimodal large language models (MLLMs) to leverage visual cues for disambiguation remains underexplored. In this paper, we introduce the benchmark specifically designed to evaluate the performance of MLLMs in Ambiguous contexts (MMA). MMA uses a multiple-choice visual question-answering format with a novel evaluation protocol in which each ambiguous text is paired with two distinct images that suggest different scenarios. This setup requires models to provide different correct answers based on the visual context, effectively testing their ability to perform cross-modal disambiguation. By evaluating 25 proprietary and open-sourced MLLMs, we find that: (1) MLLMs often overlook scenario-specific information provided by images to clarify the ambiguity of texts. When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 53.22% in answering both correctly, compared to human performance at 88.97%. (2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Proprietary models (e.g., Gemini 2.0 Pro, top performer at 78.9%) outperform open-source counterparts by an average margin of 16.78%. These findings firstly underscore the current limitations of MLLMs in integrating visual information to clarify textual ambiguities and highlight critical areas for future improvements. The codes and benchmark data are https://github.com/physicsru/mma

Cite this Paper


BibTeX
@InProceedings{pmlr-v328-wang26a, title = {MMA:Benchmarking Multi-ModalLarge Language Models in Ambiguity Contexts}, author = {Wang, Ru and Song, Selena and Wang, Yuquan and Ding, Liang and Gong, Mingming and Iwasawa, Yusuke and Matsuo, Yutaka and Guo, Jiaxian}, booktitle = {Conference on Parsimony and Learning}, pages = {529--551}, year = {2026}, editor = {Burkholz, Rebekka and Liu, Shiwei and Ravishankar, Saiprasad and Redman, William and Huang, Wei and Su, Weijie and Zhu, Zhihui}, volume = {328}, series = {Proceedings of Machine Learning Research}, month = {23--26 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v328/main/assets/wang26a/wang26a.pdf}, url = {https://proceedings.mlr.press/v328/wang26a.html}, abstract = {While visual information in multimodal settings can naturally help resolve inherent ambiguities in natural language, the ability of multimodal large language models (MLLMs) to leverage visual cues for disambiguation remains underexplored. In this paper, we introduce the benchmark specifically designed to evaluate the performance of MLLMs in Ambiguous contexts (MMA). MMA uses a multiple-choice visual question-answering format with a novel evaluation protocol in which each ambiguous text is paired with two distinct images that suggest different scenarios. This setup requires models to provide different correct answers based on the visual context, effectively testing their ability to perform cross-modal disambiguation. By evaluating 25 proprietary and open-sourced MLLMs, we find that: (1) MLLMs often overlook scenario-specific information provided by images to clarify the ambiguity of texts. When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 53.22% in answering both correctly, compared to human performance at 88.97%. (2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Proprietary models (e.g., Gemini 2.0 Pro, top performer at 78.9%) outperform open-source counterparts by an average margin of 16.78%. These findings firstly underscore the current limitations of MLLMs in integrating visual information to clarify textual ambiguities and highlight critical areas for future improvements. The codes and benchmark data are https://github.com/physicsru/mma} }
Endnote
%0 Conference Paper %T MMA:Benchmarking Multi-ModalLarge Language Models in Ambiguity Contexts %A Ru Wang %A Selena Song %A Yuquan Wang %A Liang Ding %A Mingming Gong %A Yusuke Iwasawa %A Yutaka Matsuo %A Jiaxian Guo %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2026 %E Rebekka Burkholz %E Shiwei Liu %E Saiprasad Ravishankar %E William Redman %E Wei Huang %E Weijie Su %E Zhihui Zhu %F pmlr-v328-wang26a %I PMLR %P 529--551 %U https://proceedings.mlr.press/v328/wang26a.html %V 328 %X While visual information in multimodal settings can naturally help resolve inherent ambiguities in natural language, the ability of multimodal large language models (MLLMs) to leverage visual cues for disambiguation remains underexplored. In this paper, we introduce the benchmark specifically designed to evaluate the performance of MLLMs in Ambiguous contexts (MMA). MMA uses a multiple-choice visual question-answering format with a novel evaluation protocol in which each ambiguous text is paired with two distinct images that suggest different scenarios. This setup requires models to provide different correct answers based on the visual context, effectively testing their ability to perform cross-modal disambiguation. By evaluating 25 proprietary and open-sourced MLLMs, we find that: (1) MLLMs often overlook scenario-specific information provided by images to clarify the ambiguity of texts. When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 53.22% in answering both correctly, compared to human performance at 88.97%. (2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Proprietary models (e.g., Gemini 2.0 Pro, top performer at 78.9%) outperform open-source counterparts by an average margin of 16.78%. These findings firstly underscore the current limitations of MLLMs in integrating visual information to clarify textual ambiguities and highlight critical areas for future improvements. The codes and benchmark data are https://github.com/physicsru/mma
APA
Wang, R., Song, S., Wang, Y., Ding, L., Gong, M., Iwasawa, Y., Matsuo, Y. & Guo, J.. (2026). MMA:Benchmarking Multi-ModalLarge Language Models in Ambiguity Contexts. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 328:529-551 Available from https://proceedings.mlr.press/v328/wang26a.html.

Related Material