DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Priyanka Mandikal, Kristen Grauman
Proceedings of the 5th Conference on Robot Learning, PMLR 164:651-661, 2022.

Abstract

Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent’s hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data. As a result, it can easily scale to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab—a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment to obtain human demonstrations, while also being faster to train.

Cite this Paper


BibTeX
@InProceedings{pmlr-v164-mandikal22a, title = {DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video}, author = {Mandikal, Priyanka and Grauman, Kristen}, booktitle = {Proceedings of the 5th Conference on Robot Learning}, pages = {651--661}, year = {2022}, editor = {Faust, Aleksandra and Hsu, David and Neumann, Gerhard}, volume = {164}, series = {Proceedings of Machine Learning Research}, month = {08--11 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v164/mandikal22a/mandikal22a.pdf}, url = {https://proceedings.mlr.press/v164/mandikal22a.html}, abstract = {Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent’s hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data. As a result, it can easily scale to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab—a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment to obtain human demonstrations, while also being faster to train.} }
Endnote
%0 Conference Paper %T DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video %A Priyanka Mandikal %A Kristen Grauman %B Proceedings of the 5th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2022 %E Aleksandra Faust %E David Hsu %E Gerhard Neumann %F pmlr-v164-mandikal22a %I PMLR %P 651--661 %U https://proceedings.mlr.press/v164/mandikal22a.html %V 164 %X Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent’s hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data. As a result, it can easily scale to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab—a much more expensive and indirect way to capture human expertise. Through experiments on 27 objects with a 30-DoF simulated robot hand, we demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment to obtain human demonstrations, while also being faster to train.
APA
Mandikal, P. & Grauman, K.. (2022). DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video. Proceedings of the 5th Conference on Robot Learning, in Proceedings of Machine Learning Research 164:651-661 Available from https://proceedings.mlr.press/v164/mandikal22a.html.

Related Material