Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions

Elias Stengel-Eskin, Andrew Hundt, Zhuohong He, Aditya Murali, Nakul Gopalan, Matthew Gombolay, Gregory Hager
Proceedings of the 5th Conference on Robot Learning, PMLR 164:1486-1501, 2022.

Abstract

Enabling human operators to interact with robotic agents using natural language would allow non-experts to intuitively instruct these agents. Towards this goal, we propose a novel Transformer-based model which enables a user to guide a robot arm through a 3D multi-step manipulation task with natural language commands. Our system maps images and commands to masks over grasp or place locations, grounding the language directly in perceptual space. In a suite of block rearrangement tasks, we show that these masks can be combined with an existing manipulation framework without re-training, greatly improving learning efficiency. Our masking model is several orders of magnitude more sample efficient than typical Transformer models, operating with hundreds, not millions, of examples. Our modular design allows us to leverage supervised and reinforcement learning, providing an easy interface for experimentation with different architectures. Our model completes block manipulation tasks with synthetic commands $530%$ more often than a UNet-based baseline, and learns to localize actions correctly while creating a mapping of symbols to perceptual input that supports compositional reasoning. We provide a valuable resource for 3D manipulation instruction following research by porting an existing 3D block dataset with crowdsourced language to a simulated environment. Our method’s $25.3%$ absolute improvement in identifying the correct block on the ported dataset demonstrates its ability to handle syntactic and lexical variation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v164-stengel-eskin22a, title = {Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions}, author = {Stengel-Eskin, Elias and Hundt, Andrew and He, Zhuohong and Murali, Aditya and Gopalan, Nakul and Gombolay, Matthew and Hager, Gregory}, booktitle = {Proceedings of the 5th Conference on Robot Learning}, pages = {1486--1501}, year = {2022}, editor = {Faust, Aleksandra and Hsu, David and Neumann, Gerhard}, volume = {164}, series = {Proceedings of Machine Learning Research}, month = {08--11 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v164/stengel-eskin22a/stengel-eskin22a.pdf}, url = {https://proceedings.mlr.press/v164/stengel-eskin22a.html}, abstract = { Enabling human operators to interact with robotic agents using natural language would allow non-experts to intuitively instruct these agents. Towards this goal, we propose a novel Transformer-based model which enables a user to guide a robot arm through a 3D multi-step manipulation task with natural language commands. Our system maps images and commands to masks over grasp or place locations, grounding the language directly in perceptual space. In a suite of block rearrangement tasks, we show that these masks can be combined with an existing manipulation framework without re-training, greatly improving learning efficiency. Our masking model is several orders of magnitude more sample efficient than typical Transformer models, operating with hundreds, not millions, of examples. Our modular design allows us to leverage supervised and reinforcement learning, providing an easy interface for experimentation with different architectures. Our model completes block manipulation tasks with synthetic commands $530%$ more often than a UNet-based baseline, and learns to localize actions correctly while creating a mapping of symbols to perceptual input that supports compositional reasoning. We provide a valuable resource for 3D manipulation instruction following research by porting an existing 3D block dataset with crowdsourced language to a simulated environment. Our method’s $25.3%$ absolute improvement in identifying the correct block on the ported dataset demonstrates its ability to handle syntactic and lexical variation. } }
Endnote
%0 Conference Paper %T Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions %A Elias Stengel-Eskin %A Andrew Hundt %A Zhuohong He %A Aditya Murali %A Nakul Gopalan %A Matthew Gombolay %A Gregory Hager %B Proceedings of the 5th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2022 %E Aleksandra Faust %E David Hsu %E Gerhard Neumann %F pmlr-v164-stengel-eskin22a %I PMLR %P 1486--1501 %U https://proceedings.mlr.press/v164/stengel-eskin22a.html %V 164 %X Enabling human operators to interact with robotic agents using natural language would allow non-experts to intuitively instruct these agents. Towards this goal, we propose a novel Transformer-based model which enables a user to guide a robot arm through a 3D multi-step manipulation task with natural language commands. Our system maps images and commands to masks over grasp or place locations, grounding the language directly in perceptual space. In a suite of block rearrangement tasks, we show that these masks can be combined with an existing manipulation framework without re-training, greatly improving learning efficiency. Our masking model is several orders of magnitude more sample efficient than typical Transformer models, operating with hundreds, not millions, of examples. Our modular design allows us to leverage supervised and reinforcement learning, providing an easy interface for experimentation with different architectures. Our model completes block manipulation tasks with synthetic commands $530%$ more often than a UNet-based baseline, and learns to localize actions correctly while creating a mapping of symbols to perceptual input that supports compositional reasoning. We provide a valuable resource for 3D manipulation instruction following research by porting an existing 3D block dataset with crowdsourced language to a simulated environment. Our method’s $25.3%$ absolute improvement in identifying the correct block on the ported dataset demonstrates its ability to handle syntactic and lexical variation.
APA
Stengel-Eskin, E., Hundt, A., He, Z., Murali, A., Gopalan, N., Gombolay, M. & Hager, G.. (2022). Guiding Multi-Step Rearrangement Tasks with Natural Language Instructions. Proceedings of the 5th Conference on Robot Learning, in Proceedings of Machine Learning Research 164:1486-1501 Available from https://proceedings.mlr.press/v164/stengel-eskin22a.html.

Related Material