Generalist Robot Manipulation beyond Action Labeled Data

Alexander Spiridonov, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2643-2664, 2025.

Abstract

Recent advances in generalist robot manipulation leverage pre-trained Vision–Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels—featuring humans and/or robots in action—enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations—improving downstream generalist robot policies—but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-spiridonov25a, title = {Generalist Robot Manipulation beyond Action Labeled Data}, author = {Spiridonov, Alexander and Zaech, Jan-Nico and Nikolov, Nikolay and Gool, Luc Van and Paudel, Danda Pani}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2643--2664}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/spiridonov25a/spiridonov25a.pdf}, url = {https://proceedings.mlr.press/v305/spiridonov25a.html}, abstract = {Recent advances in generalist robot manipulation leverage pre-trained Vision–Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels—featuring humans and/or robots in action—enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations—improving downstream generalist robot policies—but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.} }
Endnote
%0 Conference Paper %T Generalist Robot Manipulation beyond Action Labeled Data %A Alexander Spiridonov %A Jan-Nico Zaech %A Nikolay Nikolov %A Luc Van Gool %A Danda Pani Paudel %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-spiridonov25a %I PMLR %P 2643--2664 %U https://proceedings.mlr.press/v305/spiridonov25a.html %V 305 %X Recent advances in generalist robot manipulation leverage pre-trained Vision–Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels—featuring humans and/or robots in action—enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations—improving downstream generalist robot policies—but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.
APA
Spiridonov, A., Zaech, J., Nikolov, N., Gool, L.V. & Paudel, D.P.. (2025). Generalist Robot Manipulation beyond Action Labeled Data. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2643-2664 Available from https://proceedings.mlr.press/v305/spiridonov25a.html.

Related Material