Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Theophile Gervet; Zhou Xian; Nikolaos Gkanatsios; Katerina Fragkiadaki

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, Katerina Fragkiadaki

Proceedings of The 7th Conference on Robot Learning, PMLR 229:3949-3965, 2023.

Abstract

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot’s workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RLBench, an established manipulation benchmark, where it achieves

$10%$ absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and

$22%$ absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v229-gervet23a,
  title = 	 {Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation},
  author =       {Gervet, Theophile and Xian, Zhou and Gkanatsios, Nikolaos and Fragkiadaki, Katerina},
  booktitle = 	 {Proceedings of The 7th Conference on Robot Learning},
  pages = 	 {3949--3965},
  year = 	 {2023},
  editor = 	 {Tan, Jie and Toussaint, Marc and Darvish, Kourosh},
  volume = 	 {229},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v229/gervet23a/gervet23a.pdf},
  url = 	 {https://proceedings.mlr.press/v229/gervet23a.html},
  abstract = 	 {3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot’s workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RLBench, an established manipulation benchmark, where it achieves $10%$ absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and $22%$ absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments.}
}

Endnote

%0 Conference Paper
%T Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation
%A Theophile Gervet
%A Zhou Xian
%A Nikolaos Gkanatsios
%A Katerina Fragkiadaki
%B Proceedings of The 7th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Jie Tan
%E Marc Toussaint
%E Kourosh Darvish	
%F pmlr-v229-gervet23a
%I PMLR
%P 3949--3965
%U https://proceedings.mlr.press/v229/gervet23a.html
%V 229
%X 3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot’s workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RLBench, an established manipulation benchmark, where it achieves $10%$ absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and $22%$ absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments.

APA


Gervet, T., Xian, Z., Gkanatsios, N. & Fragkiadaki, K.. (2023). Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:3949-3965 Available from https://proceedings.mlr.press/v229/gervet23a.html.

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Abstract

Cite this Paper

Related Material