R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair; Aravind Rajeswaran; Vikash Kumar; Chelsea Finn; Abhinav Gupta

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta

Proceedings of The 6th Conference on Robot Learning, PMLR 205:892-909, 2023.

Abstract

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.

Cite this Paper

BibTeX


@InProceedings{pmlr-v205-nair23a,
  title = 	 {R3M: A Universal Visual Representation for Robot Manipulation},
  author =       {Nair, Suraj and Rajeswaran, Aravind and Kumar, Vikash and Finn, Chelsea and Gupta, Abhinav},
  booktitle = 	 {Proceedings of The 6th Conference on Robot Learning},
  pages = 	 {892--909},
  year = 	 {2023},
  editor = 	 {Liu, Karen and Kulic, Dana and Ichnowski, Jeff},
  volume = 	 {205},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--18 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v205/nair23a/nair23a.pdf},
  url = 	 {https://proceedings.mlr.press/v205/nair23a.html},
  abstract = 	 {We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. }
}

Endnote

%0 Conference Paper
%T R3M: A Universal Visual Representation for Robot Manipulation
%A Suraj Nair
%A Aravind Rajeswaran
%A Vikash Kumar
%A Chelsea Finn
%A Abhinav Gupta
%B Proceedings of The 6th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Karen Liu
%E Dana Kulic
%E Jeff Ichnowski	
%F pmlr-v205-nair23a
%I PMLR
%P 892--909
%U https://proceedings.mlr.press/v205/nair23a.html
%V 205
%X We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.

APA


Nair, S., Rajeswaran, A., Kumar, V., Finn, C. & Gupta, A.. (2023). R3M: A Universal Visual Representation for Robot Manipulation. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:892-909 Available from https://proceedings.mlr.press/v205/nair23a.html.

R3M: A Universal Visual Representation for Robot Manipulation

Abstract

Cite this Paper

Related Material