DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Priyanka Mandikal; Kristen Grauman

DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Priyanka Mandikal, Kristen Grauman

Proceedings of the 5th Conference on Robot Learning, PMLR 164:651-661, 2022.

Abstract

Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent’s hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data. As a result, it can easily scale to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab—a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment to obtain human demonstrations, while also being faster to train.

Cite this Paper

BibTeX


@InProceedings{pmlr-v164-mandikal22a,
  title = 	 {DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video},
  author =       {Mandikal, Priyanka and Grauman, Kristen},
  booktitle = 	 {Proceedings of the 5th Conference on Robot Learning},
  pages = 	 {651--661},
  year = 	 {2022},
  editor = 	 {Faust, Aleksandra and Hsu, David and Neumann, Gerhard},
  volume = 	 {164},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--11 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v164/mandikal22a/mandikal22a.pdf},
  url = 	 {https://proceedings.mlr.press/v164/mandikal22a.html},
  abstract = 	 {Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent’s hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data.  As a result, it can easily scale  to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab—a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment to obtain human demonstrations, while also being faster to train.}
}

Endnote

%0 Conference Paper
%T DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video
%A Priyanka Mandikal
%A Kristen Grauman
%B Proceedings of the 5th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Aleksandra Faust
%E David Hsu
%E Gerhard Neumann	
%F pmlr-v164-mandikal22a
%I PMLR
%P 651--661
%U https://proceedings.mlr.press/v164/mandikal22a.html
%V 164
%X Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent’s hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data.  As a result, it can easily scale  to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab—a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment to obtain human demonstrations, while also being faster to train.

APA


Mandikal, P. & Grauman, K.. (2022). DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video. Proceedings of the 5th Conference on Robot Learning, in Proceedings of Machine Learning Research 164:651-661 Available from https://proceedings.mlr.press/v164/mandikal22a.html.

DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Abstract

Cite this Paper

Related Material