Autonomous Improvement of Instruction Following Skills via Foundation Models

Zhiyuan Zhou, Pranav Atreya, Abraham Lee, Homer Rich Walke, Oier Mees, Sergey Levine
Proceedings of The 8th Conference on Robot Learning, PMLR 270:4805-4825, 2025.

Abstract

Intelligent robots capable of improving from autonomously collected experience have the potential to transform robot learning: instead of collecting costly teleoperated demonstration data, large-scale deployment of fleets of robots can quickly collect larger quantities of autonomous data useful for training better robot policies. However, autonomous improvement requires solving two key problems: (i) fully automating a scalable data collection procedure that can collect diverse and semantically meaningful robot data and (ii) learning from non-optimal, autonomous data with no human annotations. To this end, we propose a novel approach that addresses these challenges, allowing instruction following policies to improve from autonomously collected data without human supervision. Our framework leverages vision-language models to collect and evaluate semantically meaningful experiences in new environments, and then utilizes a decomposition of instruction following tasks into (semantic) language-conditioned image generation and (non-semantic) goal reaching, which makes it significantly more practical to improve from this autonomously collected data without any human annotations. We carry out extensive experiments in the real world to demonstrate the effectiveness of our approach, and find that in a suite of unseen environments, the robot policy can be improved significantly with autonomously collected data. We open-source the code for our semantic autonomous improvement pipeline, as well as our autonomous dataset of 25K trajectories collected across five tabletop environments: https://soar-autonomous-improvement.github.io

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-zhou25b, title = {Autonomous Improvement of Instruction Following Skills via Foundation Models}, author = {Zhou, Zhiyuan and Atreya, Pranav and Lee, Abraham and Walke, Homer Rich and Mees, Oier and Levine, Sergey}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {4805--4825}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/zhou25b/zhou25b.pdf}, url = {https://proceedings.mlr.press/v270/zhou25b.html}, abstract = {Intelligent robots capable of improving from autonomously collected experience have the potential to transform robot learning: instead of collecting costly teleoperated demonstration data, large-scale deployment of fleets of robots can quickly collect larger quantities of autonomous data useful for training better robot policies. However, autonomous improvement requires solving two key problems: (i) fully automating a scalable data collection procedure that can collect diverse and semantically meaningful robot data and (ii) learning from non-optimal, autonomous data with no human annotations. To this end, we propose a novel approach that addresses these challenges, allowing instruction following policies to improve from autonomously collected data without human supervision. Our framework leverages vision-language models to collect and evaluate semantically meaningful experiences in new environments, and then utilizes a decomposition of instruction following tasks into (semantic) language-conditioned image generation and (non-semantic) goal reaching, which makes it significantly more practical to improve from this autonomously collected data without any human annotations. We carry out extensive experiments in the real world to demonstrate the effectiveness of our approach, and find that in a suite of unseen environments, the robot policy can be improved significantly with autonomously collected data. We open-source the code for our semantic autonomous improvement pipeline, as well as our autonomous dataset of 25K trajectories collected across five tabletop environments: https://soar-autonomous-improvement.github.io} }
Endnote
%0 Conference Paper %T Autonomous Improvement of Instruction Following Skills via Foundation Models %A Zhiyuan Zhou %A Pranav Atreya %A Abraham Lee %A Homer Rich Walke %A Oier Mees %A Sergey Levine %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-zhou25b %I PMLR %P 4805--4825 %U https://proceedings.mlr.press/v270/zhou25b.html %V 270 %X Intelligent robots capable of improving from autonomously collected experience have the potential to transform robot learning: instead of collecting costly teleoperated demonstration data, large-scale deployment of fleets of robots can quickly collect larger quantities of autonomous data useful for training better robot policies. However, autonomous improvement requires solving two key problems: (i) fully automating a scalable data collection procedure that can collect diverse and semantically meaningful robot data and (ii) learning from non-optimal, autonomous data with no human annotations. To this end, we propose a novel approach that addresses these challenges, allowing instruction following policies to improve from autonomously collected data without human supervision. Our framework leverages vision-language models to collect and evaluate semantically meaningful experiences in new environments, and then utilizes a decomposition of instruction following tasks into (semantic) language-conditioned image generation and (non-semantic) goal reaching, which makes it significantly more practical to improve from this autonomously collected data without any human annotations. We carry out extensive experiments in the real world to demonstrate the effectiveness of our approach, and find that in a suite of unseen environments, the robot policy can be improved significantly with autonomously collected data. We open-source the code for our semantic autonomous improvement pipeline, as well as our autonomous dataset of 25K trajectories collected across five tabletop environments: https://soar-autonomous-improvement.github.io
APA
Zhou, Z., Atreya, P., Lee, A., Walke, H.R., Mees, O. & Levine, S.. (2025). Autonomous Improvement of Instruction Following Skills via Foundation Models. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:4805-4825 Available from https://proceedings.mlr.press/v270/zhou25b.html.

Related Material