FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making

Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:65291-65309, 2025.

Abstract

Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent’s physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25ee, title = {{FOUNDER}: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making}, author = {Wang, Yucen and Yu, Rui and Wan, Shenghua and Gan, Le and Zhan, De-Chuan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {65291--65309}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25ee/wang25ee.pdf}, url = {https://proceedings.mlr.press/v267/wang25ee.html}, abstract = {Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent’s physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.} }
Endnote
%0 Conference Paper %T FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making %A Yucen Wang %A Rui Yu %A Shenghua Wan %A Le Gan %A De-Chuan Zhan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25ee %I PMLR %P 65291--65309 %U https://proceedings.mlr.press/v267/wang25ee.html %V 267 %X Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent’s physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
APA
Wang, Y., Yu, R., Wan, S., Gan, L. & Zhan, D.. (2025). FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:65291-65309 Available from https://proceedings.mlr.press/v267/wang25ee.html.

Related Material