Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Mohit Shridhar; Lucas Manuelli; Dieter Fox

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Mohit Shridhar, Lucas Manuelli, Dieter Fox

Proceedings of The 6th Conference on Robot Learning, PMLR 205:785-799, 2023.

Abstract

Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by “detecting the next best voxel action”. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v205-shridhar23a,
  title = 	 {Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation},
  author =       {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
  booktitle = 	 {Proceedings of The 6th Conference on Robot Learning},
  pages = 	 {785--799},
  year = 	 {2023},
  editor = 	 {Liu, Karen and Kulic, Dana and Ichnowski, Jeff},
  volume = 	 {205},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--18 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v205/shridhar23a/shridhar23a.pdf},
  url = 	 {https://proceedings.mlr.press/v205/shridhar23a.html},
  abstract = 	 {Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by “detecting the next best voxel action”. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.}
}

Endnote

%0 Conference Paper
%T Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
%A Mohit Shridhar
%A Lucas Manuelli
%A Dieter Fox
%B Proceedings of The 6th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Karen Liu
%E Dana Kulic
%E Jeff Ichnowski	
%F pmlr-v205-shridhar23a
%I PMLR
%P 785--799
%U https://proceedings.mlr.press/v205/shridhar23a.html
%V 205
%X Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by “detecting the next best voxel action”. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

APA


Shridhar, M., Manuelli, L. & Fox, D.. (2023). Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:785-799 Available from https://proceedings.mlr.press/v205/shridhar23a.html.

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Abstract

Cite this Paper

Related Material