Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

Vivek Myers, Andre Wang He, Kuan Fang, Homer Rich Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine
Proceedings of The 7th Conference on Robot Learning, PMLR 229:3894-3908, 2023.

Abstract

Our goal is for robots to follow natural language instructions like “put the towel next to the microwave.” But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an *interface* for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-myers23a, title = {Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control}, author = {Myers, Vivek and He, Andre Wang and Fang, Kuan and Walke, Homer Rich and Hansen-Estruch, Philippe and Cheng, Ching-An and Jalobeanu, Mihai and Kolobov, Andrey and Dragan, Anca and Levine, Sergey}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {3894--3908}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/myers23a/myers23a.pdf}, url = {https://proceedings.mlr.press/v229/myers23a.html}, abstract = {Our goal is for robots to follow natural language instructions like “put the towel next to the microwave.” But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an *interface* for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.} }
Endnote
%0 Conference Paper %T Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control %A Vivek Myers %A Andre Wang He %A Kuan Fang %A Homer Rich Walke %A Philippe Hansen-Estruch %A Ching-An Cheng %A Mihai Jalobeanu %A Andrey Kolobov %A Anca Dragan %A Sergey Levine %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-myers23a %I PMLR %P 3894--3908 %U https://proceedings.mlr.press/v229/myers23a.html %V 229 %X Our goal is for robots to follow natural language instructions like “put the towel next to the microwave.” But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an *interface* for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
APA
Myers, V., He, A.W., Fang, K., Walke, H.R., Hansen-Estruch, P., Cheng, C., Jalobeanu, M., Kolobov, A., Dragan, A. & Levine, S.. (2023). Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:3894-3908 Available from https://proceedings.mlr.press/v229/myers23a.html.

Related Material