Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?

Yuchen Cui, Scott Niekum, Abhinav Gupta, Vikash Kumar, Aravind Rajeswaran
Proceedings of The 4th Annual Learning for Dynamics and Control Conference, PMLR 168:893-905, 2022.

Abstract

Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state space vectors or goal images from the same robot scene. The former is often not easily human interpretable and necessitates detailed state estimation and scene understanding. The latter requires the generation of desired goal image, which often requires a human to complete the task, defeating the purpose of having autonomous robots. In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use such as images obtained from the internet, hand sketches that provide a visual description of the desired task, or simple language descriptions. As a first step towards this, we study the capabilities of large scale pre-trained models (foundation models) for zero-shot goal specification, and find that they are surprisingly effective in a collection of simulated robot manipulation tasks and real-world datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v168-cui22a, title = {Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?}, author = {Cui, Yuchen and Niekum, Scott and Gupta, Abhinav and Kumar, Vikash and Rajeswaran, Aravind}, booktitle = {Proceedings of The 4th Annual Learning for Dynamics and Control Conference}, pages = {893--905}, year = {2022}, editor = {Firoozi, Roya and Mehr, Negar and Yel, Esen and Antonova, Rika and Bohg, Jeannette and Schwager, Mac and Kochenderfer, Mykel}, volume = {168}, series = {Proceedings of Machine Learning Research}, month = {23--24 Jun}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v168/cui22a/cui22a.pdf}, url = {https://proceedings.mlr.press/v168/cui22a.html}, abstract = { Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state space vectors or goal images from the same robot scene. The former is often not easily human interpretable and necessitates detailed state estimation and scene understanding. The latter requires the generation of desired goal image, which often requires a human to complete the task, defeating the purpose of having autonomous robots. In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use such as images obtained from the internet, hand sketches that provide a visual description of the desired task, or simple language descriptions. As a first step towards this, we study the capabilities of large scale pre-trained models (foundation models) for zero-shot goal specification, and find that they are surprisingly effective in a collection of simulated robot manipulation tasks and real-world datasets.} }
Endnote
%0 Conference Paper %T Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? %A Yuchen Cui %A Scott Niekum %A Abhinav Gupta %A Vikash Kumar %A Aravind Rajeswaran %B Proceedings of The 4th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2022 %E Roya Firoozi %E Negar Mehr %E Esen Yel %E Rika Antonova %E Jeannette Bohg %E Mac Schwager %E Mykel Kochenderfer %F pmlr-v168-cui22a %I PMLR %P 893--905 %U https://proceedings.mlr.press/v168/cui22a.html %V 168 %X Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state space vectors or goal images from the same robot scene. The former is often not easily human interpretable and necessitates detailed state estimation and scene understanding. The latter requires the generation of desired goal image, which often requires a human to complete the task, defeating the purpose of having autonomous robots. In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use such as images obtained from the internet, hand sketches that provide a visual description of the desired task, or simple language descriptions. As a first step towards this, we study the capabilities of large scale pre-trained models (foundation models) for zero-shot goal specification, and find that they are surprisingly effective in a collection of simulated robot manipulation tasks and real-world datasets.
APA
Cui, Y., Niekum, S., Gupta, A., Kumar, V. & Rajeswaran, A.. (2022). Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?. Proceedings of The 4th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 168:893-905 Available from https://proceedings.mlr.press/v168/cui22a.html.

Related Material