WoMAP: World Models For Embodied Open-Vocabulary Object Localization

Tenny Yin, Zhiting Mei, Tao Sun, Ola Sho, Anirudha Majumdar, Emily Zhou, Jeremy Bao, Miyu Yamane, Lihan Zha
Proceedings of The 9th Conference on Robot Learning, PMLR 305:3605-3630, 2025.

Abstract

Active object localization remains a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art robot policies either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP’s superior performance in a wide range of zero-shot object localization tasks, with a 63% success rate compared to 10%success rate compared to a VLM baseline, and only a 10 - 20% drop in performance when directly transferring from sim to real.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-yin25b, title = {WoMAP: World Models For Embodied Open-Vocabulary Object Localization}, author = {Yin, Tenny and Mei, Zhiting and Sun, Tao and Sho, Ola and Majumdar, Anirudha and Zhou, Emily and Bao, Jeremy and Yamane, Miyu and Zha, Lihan}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {3605--3630}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/yin25b/yin25b.pdf}, url = {https://proceedings.mlr.press/v305/yin25b.html}, abstract = {Active object localization remains a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art robot policies either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP’s superior performance in a wide range of zero-shot object localization tasks, with a 63% success rate compared to 10%success rate compared to a VLM baseline, and only a 10 - 20% drop in performance when directly transferring from sim to real.} }
Endnote
%0 Conference Paper %T WoMAP: World Models For Embodied Open-Vocabulary Object Localization %A Tenny Yin %A Zhiting Mei %A Tao Sun %A Ola Sho %A Anirudha Majumdar %A Emily Zhou %A Jeremy Bao %A Miyu Yamane %A Lihan Zha %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-yin25b %I PMLR %P 3605--3630 %U https://proceedings.mlr.press/v305/yin25b.html %V 305 %X Active object localization remains a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art robot policies either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP’s superior performance in a wide range of zero-shot object localization tasks, with a 63% success rate compared to 10%success rate compared to a VLM baseline, and only a 10 - 20% drop in performance when directly transferring from sim to real.
APA
Yin, T., Mei, Z., Sun, T., Sho, O., Majumdar, A., Zhou, E., Bao, J., Yamane, M. & Zha, L.. (2025). WoMAP: World Models For Embodied Open-Vocabulary Object Localization. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3605-3630 Available from https://proceedings.mlr.press/v305/yin25b.html.

Related Material