Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, Katerina Fragkiadaki
Proceedings of The 7th Conference on Robot Learning, PMLR 229:3949-3965, 2023.

Abstract

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot’s workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RLBench, an established manipulation benchmark, where it achieves $10%$ absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and $22%$ absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-gervet23a, title = {Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation}, author = {Gervet, Theophile and Xian, Zhou and Gkanatsios, Nikolaos and Fragkiadaki, Katerina}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {3949--3965}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/gervet23a/gervet23a.pdf}, url = {https://proceedings.mlr.press/v229/gervet23a.html}, abstract = {3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot’s workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RLBench, an established manipulation benchmark, where it achieves $10%$ absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and $22%$ absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments.} }
Endnote
%0 Conference Paper %T Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation %A Theophile Gervet %A Zhou Xian %A Nikolaos Gkanatsios %A Katerina Fragkiadaki %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-gervet23a %I PMLR %P 3949--3965 %U https://proceedings.mlr.press/v229/gervet23a.html %V 229 %X 3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot’s workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RLBench, an established manipulation benchmark, where it achieves $10%$ absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and $22%$ absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments.
APA
Gervet, T., Xian, Z., Gkanatsios, N. & Fragkiadaki, K.. (2023). Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:3949-3965 Available from https://proceedings.mlr.press/v229/gervet23a.html.

Related Material