Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Nils Blank, Moritz Reuss, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Wenzel, Oier Mees, Rudolf Lioutikov
Proceedings of The 8th Conference on Robot Learning, PMLR 270:4158-4187, 2025.

Abstract

A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pre-trained vision-language foundation models in a sophisticated, carefully considered manner in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2 and a kitchen play dataset show that NILS is able to autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets, while alleviating several shortcomings of crowdsourced human annotations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-blank25a, title = {Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models}, author = {Blank, Nils and Reuss, Moritz and R{\"{u}}hle, Marcel and Ya\u{g}murlu, {\"{O}}mer Erdin\c{c} and Wenzel, Fabian and Mees, Oier and Lioutikov, Rudolf}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {4158--4187}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/blank25a/blank25a.pdf}, url = {https://proceedings.mlr.press/v270/blank25a.html}, abstract = {A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pre-trained vision-language foundation models in a sophisticated, carefully considered manner in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2 and a kitchen play dataset show that NILS is able to autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets, while alleviating several shortcomings of crowdsourced human annotations.} }
Endnote
%0 Conference Paper %T Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models %A Nils Blank %A Moritz Reuss %A Marcel Rühle %A Ömer Erdinç Yağmurlu %A Fabian Wenzel %A Oier Mees %A Rudolf Lioutikov %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-blank25a %I PMLR %P 4158--4187 %U https://proceedings.mlr.press/v270/blank25a.html %V 270 %X A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pre-trained vision-language foundation models in a sophisticated, carefully considered manner in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2 and a kitchen play dataset show that NILS is able to autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets, while alleviating several shortcomings of crowdsourced human annotations.
APA
Blank, N., Reuss, M., Rühle, M., Yağmurlu, Ö.E., Wenzel, F., Mees, O. & Lioutikov, R.. (2025). Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:4158-4187 Available from https://proceedings.mlr.press/v270/blank25a.html.

Related Material