RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

Yao Mu, Junting Chen, Qing-Long Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:36434-36454, 2024.

Abstract

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one embodied navigation task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-mu24a, title = {{R}obo{C}ode{X}: Multimodal Code Generation for Robotic Behavior Synthesis}, author = {Mu, Yao and Chen, Junting and Zhang, Qing-Long and Chen, Shoufa and Yu, Qiaojun and Ge, Chongjian and Chen, Runjian and Liang, Zhixuan and Hu, Mengkang and Tao, Chaofan and Sun, Peize and Yu, Haibao and Yang, Chao and Shao, Wenqi and Wang, Wenhai and Dai, Jifeng and Qiao, Yu and Ding, Mingyu and Luo, Ping}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {36434--36454}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/mu24a/mu24a.pdf}, url = {https://proceedings.mlr.press/v235/mu24a.html}, abstract = {Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one embodied navigation task.} }
Endnote
%0 Conference Paper %T RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis %A Yao Mu %A Junting Chen %A Qing-Long Zhang %A Shoufa Chen %A Qiaojun Yu %A Chongjian Ge %A Runjian Chen %A Zhixuan Liang %A Mengkang Hu %A Chaofan Tao %A Peize Sun %A Haibao Yu %A Chao Yang %A Wenqi Shao %A Wenhai Wang %A Jifeng Dai %A Yu Qiao %A Mingyu Ding %A Ping Luo %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-mu24a %I PMLR %P 36434--36454 %U https://proceedings.mlr.press/v235/mu24a.html %V 235 %X Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one embodied navigation task.
APA
Mu, Y., Chen, J., Zhang, Q., Chen, S., Yu, Q., Ge, C., Chen, R., Liang, Z., Hu, M., Tao, C., Sun, P., Yu, H., Yang, C., Shao, W., Wang, W., Dai, J., Qiao, Y., Ding, M. & Luo, P.. (2024). RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:36434-36454 Available from https://proceedings.mlr.press/v235/mu24a.html.

Related Material