iWISDM: Assessing instruction following in multimodal models at scale

Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan
Proceedings of The 3rd Conference on Lifelong Learning Agents, PMLR 274:457-480, 2025.

Abstract

The ability to perform complex tasks from detailed instructions is a key to the remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs — either text or vision — and thus, narrowing the scope of multimodal integration assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap in these models’ ability to precisely follow instructions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v274-lei25a, title = {iWISDM: Assessing instruction following in multimodal models at scale}, author = {Lei, Xiaoxuan and Gomez, Lucas and Bai, Hao Yuan and Bashivan, Pouya}, booktitle = {Proceedings of The 3rd Conference on Lifelong Learning Agents}, pages = {457--480}, year = {2025}, editor = {Lomonaco, Vincenzo and Melacci, Stefano and Tuytelaars, Tinne and Chandar, Sarath and Pascanu, Razvan}, volume = {274}, series = {Proceedings of Machine Learning Research}, month = {29 Jul--01 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v274/main/assets/lei25a/lei25a.pdf}, url = {https://proceedings.mlr.press/v274/lei25a.html}, abstract = {The ability to perform complex tasks from detailed instructions is a key to the remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs — either text or vision — and thus, narrowing the scope of multimodal integration assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap in these models’ ability to precisely follow instructions.} }
Endnote
%0 Conference Paper %T iWISDM: Assessing instruction following in multimodal models at scale %A Xiaoxuan Lei %A Lucas Gomez %A Hao Yuan Bai %A Pouya Bashivan %B Proceedings of The 3rd Conference on Lifelong Learning Agents %C Proceedings of Machine Learning Research %D 2025 %E Vincenzo Lomonaco %E Stefano Melacci %E Tinne Tuytelaars %E Sarath Chandar %E Razvan Pascanu %F pmlr-v274-lei25a %I PMLR %P 457--480 %U https://proceedings.mlr.press/v274/lei25a.html %V 274 %X The ability to perform complex tasks from detailed instructions is a key to the remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs — either text or vision — and thus, narrowing the scope of multimodal integration assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap in these models’ ability to precisely follow instructions.
APA
Lei, X., Gomez, L., Bai, H.Y. & Bashivan, P.. (2025). iWISDM: Assessing instruction following in multimodal models at scale. Proceedings of The 3rd Conference on Lifelong Learning Agents, in Proceedings of Machine Learning Research 274:457-480 Available from https://proceedings.mlr.press/v274/lei25a.html.

Related Material