GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

Abhay Deshpande, Yuquan Deng, Jordi Salvador, Arijit Ray, Winson Han, Jiafei Duan, Rose Hendrix, Yuke Zhu, Ranjay Krishna
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2983-3007, 2025.

Abstract

We present GraspMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from a large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative. GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot. We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-deshpande25a, title = {GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation}, author = {Deshpande, Abhay and Deng, Yuquan and Salvador, Jordi and Ray, Arijit and Han, Winson and Duan, Jiafei and Hendrix, Rose and Zhu, Yuke and Krishna, Ranjay}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2983--3007}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/deshpande25a/deshpande25a.pdf}, url = {https://proceedings.mlr.press/v305/deshpande25a.html}, abstract = {We present GraspMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from a large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative. GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot. We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation.} }
Endnote
%0 Conference Paper %T GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation %A Abhay Deshpande %A Yuquan Deng %A Jordi Salvador %A Arijit Ray %A Winson Han %A Jiafei Duan %A Rose Hendrix %A Yuke Zhu %A Ranjay Krishna %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-deshpande25a %I PMLR %P 2983--3007 %U https://proceedings.mlr.press/v305/deshpande25a.html %V 305 %X We present GraspMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from a large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative. GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot. We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation.
APA
Deshpande, A., Deng, Y., Salvador, J., Ray, A., Han, W., Duan, J., Hendrix, R., Zhu, Y. & Krishna, R.. (2025). GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2983-3007 Available from https://proceedings.mlr.press/v305/deshpande25a.html.

Related Material