Distilling Internet-Scale Vision-Language Models into Embodied Agents

Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita Dasgupta
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:32797-32818, 2023.

Abstract

Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent’s behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-sumers23a, title = {Distilling Internet-Scale Vision-Language Models into Embodied Agents}, author = {Sumers, Theodore and Marino, Kenneth and Ahuja, Arun and Fergus, Rob and Dasgupta, Ishita}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {32797--32818}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/sumers23a/sumers23a.pdf}, url = {https://proceedings.mlr.press/v202/sumers23a.html}, abstract = {Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent’s behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.} }
Endnote
%0 Conference Paper %T Distilling Internet-Scale Vision-Language Models into Embodied Agents %A Theodore Sumers %A Kenneth Marino %A Arun Ahuja %A Rob Fergus %A Ishita Dasgupta %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-sumers23a %I PMLR %P 32797--32818 %U https://proceedings.mlr.press/v202/sumers23a.html %V 202 %X Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent’s behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
APA
Sumers, T., Marino, K., Ahuja, A., Fergus, R. & Dasgupta, I.. (2023). Distilling Internet-Scale Vision-Language Models into Embodied Agents. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:32797-32818 Available from https://proceedings.mlr.press/v202/sumers23a.html.

Related Material