A3VLM: Actionable Articulation-Aware Vision Language Model

Siyuan Huang; Haonan Chang; Yuhan Liu; Yimeng Zhu; Hao Dong; Abdeslam Boularias; Peng Gao; Hongsheng Li

A3VLM: Actionable Articulation-Aware Vision Language Model

Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Abdeslam Boularias, Peng Gao, Hongsheng Li

Proceedings of The 8th Conference on Robot Learning, PMLR 270:1675-1690, 2025.

Abstract

Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-huang25b,
  title = 	 {A3VLM: Actionable Articulation-Aware Vision Language Model},
  author =       {Huang, Siyuan and Chang, Haonan and Liu, Yuhan and Zhu, Yimeng and Dong, Hao and Boularias, Abdeslam and Gao, Peng and Li, Hongsheng},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {1675--1690},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/huang25b/huang25b.pdf},
  url = 	 {https://proceedings.mlr.press/v270/huang25b.html},
  abstract = 	 {Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.}
}

Endnote

%0 Conference Paper
%T A3VLM: Actionable Articulation-Aware Vision Language Model
%A Siyuan Huang
%A Haonan Chang
%A Yuhan Liu
%A Yimeng Zhu
%A Hao Dong
%A Abdeslam Boularias
%A Peng Gao
%A Hongsheng Li
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-huang25b
%I PMLR
%P 1675--1690
%U https://proceedings.mlr.press/v270/huang25b.html
%V 270
%X Vision Language Models (VLMs) for robotics have received significant attention in recent years. As a VLM can understand robot observations and perform complex visual reasoning, it is regarded as a potential universal solution for general robotics challenges such as manipulation and navigation. However, previous robotics VLMs such as RT-1, RT-2, and ManipLLM have focused on directly learning robot actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM.

APA

Huang, S., Chang, H., Liu, Y., Zhu, Y., Dong, H., Boularias, A., Gao, P. & Li, H.. (2025). A3VLM: Actionable Articulation-Aware Vision Language Model. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:1675-1690 Available from https://proceedings.mlr.press/v270/huang25b.html.

A3VLM: Actionable Articulation-Aware Vision Language Model

Abstract

Cite this Paper

Related Material