Zero-Shot Reward Specification via Grounded Natural Language

Parsa Mahmoudieh, Deepak Pathak, Trevor Darrell
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:14743-14752, 2022.

Abstract

Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-mahmoudieh22a, title = {Zero-Shot Reward Specification via Grounded Natural Language}, author = {Mahmoudieh, Parsa and Pathak, Deepak and Darrell, Trevor}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {14743--14752}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/mahmoudieh22a/mahmoudieh22a.pdf}, url = {https://proceedings.mlr.press/v162/mahmoudieh22a.html}, abstract = {Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions.} }
Endnote
%0 Conference Paper %T Zero-Shot Reward Specification via Grounded Natural Language %A Parsa Mahmoudieh %A Deepak Pathak %A Trevor Darrell %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-mahmoudieh22a %I PMLR %P 14743--14752 %U https://proceedings.mlr.press/v162/mahmoudieh22a.html %V 162 %X Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions.
APA
Mahmoudieh, P., Pathak, D. & Darrell, T.. (2022). Zero-Shot Reward Specification via Grounded Natural Language. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:14743-14752 Available from https://proceedings.mlr.press/v162/mahmoudieh22a.html.

Related Material