KITE: Keypoint-Conditioned Policies for Semantic Manipulation

Priya Sundaresan; Suneel Belkhale; Dorsa Sadigh; Jeannette Bohg

KITE: Keypoint-Conditioned Policies for Semantic Manipulation

Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, Jeannette Bohg

Proceedings of The 7th Conference on Robot Learning, PMLR 229:1006-1021, 2023.

Abstract

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation – where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution, a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a $75%$, $70%$, and $71%$ overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: https://tinyurl.com/kite-site.

Cite this Paper

BibTeX


@InProceedings{pmlr-v229-sundaresan23a,
  title = 	 {KITE: Keypoint-Conditioned Policies for Semantic Manipulation},
  author =       {Sundaresan, Priya and Belkhale, Suneel and Sadigh, Dorsa and Bohg, Jeannette},
  booktitle = 	 {Proceedings of The 7th Conference on Robot Learning},
  pages = 	 {1006--1021},
  year = 	 {2023},
  editor = 	 {Tan, Jie and Toussaint, Marc and Darvish, Kourosh},
  volume = 	 {229},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v229/sundaresan23a/sundaresan23a.pdf},
  url = 	 {https://proceedings.mlr.press/v229/sundaresan23a.html},
  abstract = 	 {While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation – where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution, a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a $75%$, $70%$, and $71%$ overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: https://tinyurl.com/kite-site.}
}

Endnote

%0 Conference Paper
%T KITE: Keypoint-Conditioned Policies for Semantic Manipulation
%A Priya Sundaresan
%A Suneel Belkhale
%A Dorsa Sadigh
%A Jeannette Bohg
%B Proceedings of The 7th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Jie Tan
%E Marc Toussaint
%E Kourosh Darvish	
%F pmlr-v229-sundaresan23a
%I PMLR
%P 1006--1021
%U https://proceedings.mlr.press/v229/sundaresan23a.html
%V 229
%X While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation – where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution, a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a $75%$, $70%$, and $71%$ overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: https://tinyurl.com/kite-site.

APA


Sundaresan, P., Belkhale, S., Sadigh, D. & Bohg, J.. (2023). KITE: Keypoint-Conditioned Policies for Semantic Manipulation. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:1006-1021 Available from https://proceedings.mlr.press/v229/sundaresan23a.html.

KITE: Keypoint-Conditioned Policies for Semantic Manipulation

Abstract

Cite this Paper

Related Material