FindThis: Language-Driven Object Disambiguation in Indoor Environments

Arjun Majumdar, Fei Xia, Brian Ichter, Dhruv Batra, Leonidas Guibas
Proceedings of The 7th Conference on Robot Learning, PMLR 229:1335-1347, 2023.

Abstract

Natural language is naturally ambiguous. In this work, we consider interactions between a user and a mobile service robot tasked with locating a desired object, specified by a language utterance. We present a task FindThis, which addresses the problem of how to disambiguate and locate the particular object instance desired through a dialog with the user. To approach this problem we propose an algorithm, GoFind, which exploits visual attributes of the object that may be intrinsic (e.g., color, shape), or extrinsic (e.g., location, relationships to other entities), expressed in an open vocabulary. GoFind leverages the visual common sense learned by large language models to enable fine-grained object localization and attribute differentiation in a zero-shot manner. We also provide a new visio-linguistic dataset, 3D Objects in Context (3DOC), for evaluating agents on this task consisting of Google Scanned Objects placed in Habitat-Matterport 3D scenes. Finally, we validate our approach on a real robot operating in an unstructured physical office environment using complex fine-grained language instructions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-majumdar23a, title = {FindThis: Language-Driven Object Disambiguation in Indoor Environments}, author = {Majumdar, Arjun and Xia, Fei and Ichter, Brian and Batra, Dhruv and Guibas, Leonidas}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {1335--1347}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/majumdar23a/majumdar23a.pdf}, url = {https://proceedings.mlr.press/v229/majumdar23a.html}, abstract = {Natural language is naturally ambiguous. In this work, we consider interactions between a user and a mobile service robot tasked with locating a desired object, specified by a language utterance. We present a task FindThis, which addresses the problem of how to disambiguate and locate the particular object instance desired through a dialog with the user. To approach this problem we propose an algorithm, GoFind, which exploits visual attributes of the object that may be intrinsic (e.g., color, shape), or extrinsic (e.g., location, relationships to other entities), expressed in an open vocabulary. GoFind leverages the visual common sense learned by large language models to enable fine-grained object localization and attribute differentiation in a zero-shot manner. We also provide a new visio-linguistic dataset, 3D Objects in Context (3DOC), for evaluating agents on this task consisting of Google Scanned Objects placed in Habitat-Matterport 3D scenes. Finally, we validate our approach on a real robot operating in an unstructured physical office environment using complex fine-grained language instructions.} }
Endnote
%0 Conference Paper %T FindThis: Language-Driven Object Disambiguation in Indoor Environments %A Arjun Majumdar %A Fei Xia %A Brian Ichter %A Dhruv Batra %A Leonidas Guibas %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-majumdar23a %I PMLR %P 1335--1347 %U https://proceedings.mlr.press/v229/majumdar23a.html %V 229 %X Natural language is naturally ambiguous. In this work, we consider interactions between a user and a mobile service robot tasked with locating a desired object, specified by a language utterance. We present a task FindThis, which addresses the problem of how to disambiguate and locate the particular object instance desired through a dialog with the user. To approach this problem we propose an algorithm, GoFind, which exploits visual attributes of the object that may be intrinsic (e.g., color, shape), or extrinsic (e.g., location, relationships to other entities), expressed in an open vocabulary. GoFind leverages the visual common sense learned by large language models to enable fine-grained object localization and attribute differentiation in a zero-shot manner. We also provide a new visio-linguistic dataset, 3D Objects in Context (3DOC), for evaluating agents on this task consisting of Google Scanned Objects placed in Habitat-Matterport 3D scenes. Finally, we validate our approach on a real robot operating in an unstructured physical office environment using complex fine-grained language instructions.
APA
Majumdar, A., Xia, F., Ichter, B., Batra, D. & Guibas, L.. (2023). FindThis: Language-Driven Object Disambiguation in Indoor Environments. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:1335-1347 Available from https://proceedings.mlr.press/v229/majumdar23a.html.

Related Material