Real-World Robot Learning with Masked Visual Pre-training

Ilija Radosavovic; Tete Xiao; Stephen James; Pieter Abbeel; Jitendra Malik; Trevor Darrell

Real-World Robot Learning with Masked Visual Pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell

Proceedings of The 6th Conference on Robot Learning, PMLR 205:416-426, 2023.

Abstract

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Cite this Paper

BibTeX


@InProceedings{pmlr-v205-radosavovic23a,
  title = 	 {Real-World Robot Learning with Masked Visual Pre-training},
  author =       {Radosavovic, Ilija and Xiao, Tete and James, Stephen and Abbeel, Pieter and Malik, Jitendra and Darrell, Trevor},
  booktitle = 	 {Proceedings of The 6th Conference on Robot Learning},
  pages = 	 {416--426},
  year = 	 {2023},
  editor = 	 {Liu, Karen and Kulic, Dana and Ichnowski, Jeff},
  volume = 	 {205},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--18 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v205/radosavovic23a/radosavovic23a.pdf},
  url = 	 {https://proceedings.mlr.press/v205/radosavovic23a.html},
  abstract = 	 {In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.}
}

Endnote

%0 Conference Paper
%T Real-World Robot Learning with Masked Visual Pre-training
%A Ilija Radosavovic
%A Tete Xiao
%A Stephen James
%A Pieter Abbeel
%A Jitendra Malik
%A Trevor Darrell
%B Proceedings of The 6th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Karen Liu
%E Dana Kulic
%E Jeff Ichnowski	
%F pmlr-v205-radosavovic23a
%I PMLR
%P 416--426
%U https://proceedings.mlr.press/v205/radosavovic23a.html
%V 205
%X In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

APA


Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J. & Darrell, T.. (2023). Real-World Robot Learning with Masked Visual Pre-training. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:416-426 Available from https://proceedings.mlr.press/v205/radosavovic23a.html.

Real-World Robot Learning with Masked Visual Pre-training

Abstract

Cite this Paper

Related Material