Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2038-2062, 2025.

Abstract

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs’ physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a “reflection” mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://corl2025-reflectvlm.github.io.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-feng25b, title = {Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation}, author = {Feng, Yunhai and Han, Jiaming and Yang, Zhuoran and Yue, Xiangyu and Levine, Sergey and Luo, Jianlan}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2038--2062}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/feng25b/feng25b.pdf}, url = {https://proceedings.mlr.press/v305/feng25b.html}, abstract = {Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs’ physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a “reflection” mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://corl2025-reflectvlm.github.io.} }
Endnote
%0 Conference Paper %T Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation %A Yunhai Feng %A Jiaming Han %A Zhuoran Yang %A Xiangyu Yue %A Sergey Levine %A Jianlan Luo %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-feng25b %I PMLR %P 2038--2062 %U https://proceedings.mlr.press/v305/feng25b.html %V 305 %X Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs’ physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a “reflection” mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://corl2025-reflectvlm.github.io.
APA
Feng, Y., Han, J., Yang, Z., Yue, X., Levine, S. & Luo, J.. (2025). Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2038-2062 Available from https://proceedings.mlr.press/v305/feng25b.html.

Related Material