[edit]
Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping
Proceedings of The 7th Conference on Robot Learning, PMLR 229:178-200, 2023.
Abstract
Grasping objects by a specific subpart is often crucial for safety and for executing downstream tasks. We propose LERF-TOGO, Language Embedded Radiance Fields for Task-Oriented Grasping of Objects, which uses vision-language models zero-shot to output a grasp distribution over an object given a natural language query. To accomplish this, we first construct a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D language field queryable with text. However, LERF has no sense of object boundaries, so its relevancy outputs often return incomplete activations over an object which are insufficient for grasping. LERF-TOGO mitigates this lack of spatial grouping by extracting a 3D object mask via DINO features and then conditionally querying LERF on this mask to obtain a semantic distribution over the object to rank grasps from an off-the-shelf grasp planner. We evaluate LERF-TOGO’s ability to grasp task-oriented object parts on 31 physical objects, and find it selects grasps on the correct part in $81%$ of trials and grasps successfully in $69%$. Code, data, appendix, and details are available at: lerftogo.github.io