VIP: Vision Instructed Pre-training for Robotic Manipulation

Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:35769-35778, 2025.

Abstract

The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like “opening the lid of a tightly sealed bottle”.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25by, title = {{VIP}: Vision Instructed Pre-training for Robotic Manipulation}, author = {Li, Zhuoling and Ren, Liangliang and Yang, Jinrong and Zhao, Yong and Wu, Xiaoyang and Xu, Zhenhua and Bai, Xiang and Zhao, Hengshuang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {35769--35778}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25by/li25by.pdf}, url = {https://proceedings.mlr.press/v267/li25by.html}, abstract = {The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like “opening the lid of a tightly sealed bottle”.} }
Endnote
%0 Conference Paper %T VIP: Vision Instructed Pre-training for Robotic Manipulation %A Zhuoling Li %A Liangliang Ren %A Jinrong Yang %A Yong Zhao %A Xiaoyang Wu %A Zhenhua Xu %A Xiang Bai %A Hengshuang Zhao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25by %I PMLR %P 35769--35778 %U https://proceedings.mlr.press/v267/li25by.html %V 267 %X The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like “opening the lid of a tightly sealed bottle”.
APA
Li, Z., Ren, L., Yang, J., Zhao, Y., Wu, X., Xu, Z., Bai, X. & Zhao, H.. (2025). VIP: Vision Instructed Pre-training for Robotic Manipulation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:35769-35778 Available from https://proceedings.mlr.press/v267/li25by.html.

Related Material