A3VLM: Actionable Articulation-Aware Vision Language Model

Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Abdeslam Boularias, Peng Gao, Hongsheng Li
Proceedings of The 8th Conference on Robot Learning, PMLR 270:1675-1690, 2025.

Abstract

Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-huang25b, title = {A3VLM: Actionable Articulation-Aware Vision Language Model}, author = {Huang, Siyuan and Chang, Haonan and Liu, Yuhan and Zhu, Yimeng and Dong, Hao and Boularias, Abdeslam and Gao, Peng and Li, Hongsheng}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {1675--1690}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/huang25b/huang25b.pdf}, url = {https://proceedings.mlr.press/v270/huang25b.html}, abstract = {Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.} }
Endnote
%0 Conference Paper %T A3VLM: Actionable Articulation-Aware Vision Language Model %A Siyuan Huang %A Haonan Chang %A Yuhan Liu %A Yimeng Zhu %A Hao Dong %A Abdeslam Boularias %A Peng Gao %A Hongsheng Li %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-huang25b %I PMLR %P 1675--1690 %U https://proceedings.mlr.press/v270/huang25b.html %V 270 %X Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.
APA
Huang, S., Chang, H., Liu, Y., Zhu, Y., Dong, H., Boularias, A., Gao, P. & Li, H.. (2025). A3VLM: Actionable Articulation-Aware Vision Language Model. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:1675-1690 Available from https://proceedings.mlr.press/v270/huang25b.html.

Related Material