ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Enyu Zhao; Vedant Raval; Hejia Zhang; Jiageng Mao; Zeyu Shangguan; Stefanos Nikolaidis; Yue Wang; Daniel Seita

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita

Proceedings of The 9th Conference on Robot Learning, PMLR 305:3413-3462, 2025.

Abstract

Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 35 common and state-of-the-art VLM families on our benchmark, including variants to test different model sizes. The performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-zhao25a,
  title = 	 {ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation},
  author =       {Zhao, Enyu and Raval, Vedant and Zhang, Hejia and Mao, Jiageng and Shangguan, Zeyu and Nikolaidis, Stefanos and Wang, Yue and Seita, Daniel},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {3413--3462},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/zhao25a/zhao25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/zhao25a.html},
  abstract = 	 {Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 35 common and state-of-the-art VLM families on our benchmark, including variants to test different model sizes. The performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding.}
}

Endnote

%0 Conference Paper
%T ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
%A Enyu Zhao
%A Vedant Raval
%A Hejia Zhang
%A Jiageng Mao
%A Zeyu Shangguan
%A Stefanos Nikolaidis
%A Yue Wang
%A Daniel Seita
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-zhao25a
%I PMLR
%P 3413--3462
%U https://proceedings.mlr.press/v305/zhao25a.html
%V 305
%X Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 35 common and state-of-the-art VLM families on our benchmark, including variants to test different model sizes. The performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding.

APA

Zhao, E., Raval, V., Zhang, H., Mao, J., Shangguan, Z., Nikolaidis, S., Wang, Y. & Seita, D.. (2025). ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3413-3462 Available from https://proceedings.mlr.press/v305/zhao25a.html.

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Abstract

Cite this Paper

Related Material