Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Jiachen Li; Qiaozi Gao; Michael Johnston; Xiaofeng Gao; Xuehai He; Hangjie Shi; Suhaila Shakiah; Reza Ghanadan; William Yang Wang

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Jiachen Li, Qiaozi Gao, Michael Johnston, Xiaofeng Gao, Xuehai He, Hangjie Shi, Suhaila Shakiah, Reza Ghanadan, William Yang Wang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:27822-27845, 2024.

Abstract

Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models’ tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots’ capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-li24x,
  title = 	 {Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning},
  author =       {Li, Jiachen and Gao, Qiaozi and Johnston, Michael and Gao, Xiaofeng and He, Xuehai and Shi, Hangjie and Shakiah, Suhaila and Ghanadan, Reza and Wang, William Yang},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {27822--27845},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/li24x/li24x.pdf},
  url = 	 {https://proceedings.mlr.press/v235/li24x.html},
  abstract = 	 {Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models’ tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots’ capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability.}
}

Endnote

%0 Conference Paper
%T Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
%A Jiachen Li
%A Qiaozi Gao
%A Michael Johnston
%A Xiaofeng Gao
%A Xuehai He
%A Hangjie Shi
%A Suhaila Shakiah
%A Reza Ghanadan
%A William Yang Wang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-li24x
%I PMLR
%P 27822--27845
%U https://proceedings.mlr.press/v235/li24x.html
%V 235
%X Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models’ tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots’ capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability.

APA


Li, J., Gao, Q., Johnston, M., Gao, X., He, X., Shi, H., Shakiah, S., Ghanadan, R. & Wang, W.Y.. (2024). Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:27822-27845 Available from https://proceedings.mlr.press/v235/li24x.html.

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Abstract

Cite this Paper

Related Material