M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Wentao Yuan; Adithyavairavan Murali; Arsalan Mousavian; Dieter Fox

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

Proceedings of The 7th Conference on Robot Learning, PMLR 229:3619-3630, 2023.

Abstract

With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about

$19%$ in overall performance and

$37.5%$ in challenging scenes were the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available at m2-t2.github.io.

Cite this Paper

BibTeX


@InProceedings{pmlr-v229-yuan23a,
  title = 	 {M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place},
  author =       {Yuan, Wentao and Murali, Adithyavairavan and Mousavian, Arsalan and Fox, Dieter},
  booktitle = 	 {Proceedings of The 7th Conference on Robot Learning},
  pages = 	 {3619--3630},
  year = 	 {2023},
  editor = 	 {Tan, Jie and Toussaint, Marc and Darvish, Kourosh},
  volume = 	 {229},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v229/yuan23a/yuan23a.pdf},
  url = 	 {https://proceedings.mlr.press/v229/yuan23a.html},
  abstract = 	 {With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about $19%$ in overall performance and $37.5%$ in challenging scenes were the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available at m2-t2.github.io.}
}

Endnote

%0 Conference Paper
%T M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place
%A Wentao Yuan
%A Adithyavairavan Murali
%A Arsalan Mousavian
%A Dieter Fox
%B Proceedings of The 7th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Jie Tan
%E Marc Toussaint
%E Kourosh Darvish	
%F pmlr-v229-yuan23a
%I PMLR
%P 3619--3630
%U https://proceedings.mlr.press/v229/yuan23a.html
%V 229
%X With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about $19%$ in overall performance and $37.5%$ in challenging scenes were the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available at m2-t2.github.io.

APA


Yuan, W., Murali, A., Mousavian, A. & Fox, D.. (2023). M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:3619-3630 Available from https://proceedings.mlr.press/v229/yuan23a.html.

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Abstract

Cite this Paper

Related Material