Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models

Seungjae Lee, Daniel Ekpo, Haowen Liu, Furong Huang, Abhinav Shrivastava, Jia-Bin Huang
Proceedings of The 9th Conference on Robot Learning, PMLR 305:4837-4858, 2025.

Abstract

Exploration is key for general-purpose robotic learning, particularly in open-ended environments where explicit guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration often emerges from the drive to discover novel scene configurations and to understand the environment. Inspired by this, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE produces more diverse and meaningful exploration than RL baselines. The collected data facilitates learning downstream tasks that closely match those of policies trained on human-collected demonstrations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-lee25b, title = {Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models}, author = {Lee, Seungjae and Ekpo, Daniel and Liu, Haowen and Huang, Furong and Shrivastava, Abhinav and Huang, Jia-Bin}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {4837--4858}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/lee25b/lee25b.pdf}, url = {https://proceedings.mlr.press/v305/lee25b.html}, abstract = {Exploration is key for general-purpose robotic learning, particularly in open-ended environments where explicit guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration often emerges from the drive to discover novel scene configurations and to understand the environment. Inspired by this, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE produces more diverse and meaningful exploration than RL baselines. The collected data facilitates learning downstream tasks that closely match those of policies trained on human-collected demonstrations.} }
Endnote
%0 Conference Paper %T Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models %A Seungjae Lee %A Daniel Ekpo %A Haowen Liu %A Furong Huang %A Abhinav Shrivastava %A Jia-Bin Huang %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-lee25b %I PMLR %P 4837--4858 %U https://proceedings.mlr.press/v305/lee25b.html %V 305 %X Exploration is key for general-purpose robotic learning, particularly in open-ended environments where explicit guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration often emerges from the drive to discover novel scene configurations and to understand the environment. Inspired by this, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE produces more diverse and meaningful exploration than RL baselines. The collected data facilitates learning downstream tasks that closely match those of policies trained on human-collected demonstrations.
APA
Lee, S., Ekpo, D., Liu, H., Huang, F., Shrivastava, A. & Huang, J.. (2025). Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:4837-4858 Available from https://proceedings.mlr.press/v305/lee25b.html.

Related Material