DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang; Seonghyeon Ye; Zongyu Lin; Jiannan Xiang; Johan Bjorck; Yu Fang; Fengyuan Hu; Spencer Huang; Kaushil Kundalia; Yen-Chen Lin; Loïc Magne; Ajay Mandlekar; Avnish Narayan; You Liang Tan; Guanzhi Wang; Jing Wang; Qi Wang; Yinzhen Xu; Xiaohui Zeng; Kaiyuan Zheng; Ruijie Zheng; Ming-Yu Liu; Luke Zettlemoyer; Dieter Fox; Jan Kautz; Scott Reed; Yuke Zhu; Linxi Fan

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loïc Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zhu, Linxi Fan

Proceedings of The 9th Conference on Robot Learning, PMLR 305:5170-5194, 2025.

Abstract

In this work, we unlock new capabilities in robot learning from neural trajectories, synthetic robot data generated from video world models. Our proposed recipe is simple, but powerful: we take the most recent state-of-the-art video generative models (world models), adapt them to the target robot embodiment, and generate new, synthetic robot data of the same task or even new behaviors. Since these video world models only generate videos, we explore two techniques of getting robot actions: extracting latent actions from a general-purpose latent action model and getting predicted actions from an inverse-dynamics model (IDM), giving flexibility across diverse scenarios. Our proposed approach unlocks behavior and environment generalization, allowing a humanoid robot to perform 20+ new behaviors in unseen environments while only collecting teleoperation data for pick and place in a single environment. By introducing a new world modeling benchmark, we demonstrate that stronger video world models directly correlate with improved downstream robot policy performance. This establishes a new scaling dimension beyond simply collecting additional teleoperation data, changing how we approach robot learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-jang25a,
  title = 	 {DreamGen: Unlocking Generalization in Robot Learning through Video World Models},
  author =       {Jang, Joel and Ye, Seonghyeon and Lin, Zongyu and Xiang, Jiannan and Bjorck, Johan and Fang, Yu and Hu, Fengyuan and Huang, Spencer and Kundalia, Kaushil and Lin, Yen-Chen and Magne, Lo\"{i}c and Mandlekar, Ajay and Narayan, Avnish and Tan, You Liang and Wang, Guanzhi and Wang, Jing and Wang, Qi and Xu, Yinzhen and Zeng, Xiaohui and Zheng, Kaiyuan and Zheng, Ruijie and Liu, Ming-Yu and Zettlemoyer, Luke and Fox, Dieter and Kautz, Jan and Reed, Scott and Zhu, Yuke and Fan, Linxi},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {5170--5194},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/jang25a/jang25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/jang25a.html},
  abstract = 	 {In this work, we unlock new capabilities in robot learning from neural trajectories, synthetic robot data generated from video world models. Our proposed recipe is simple, but powerful: we take the most recent state-of-the-art video generative models (world models), adapt them to the target robot embodiment, and generate new, synthetic robot data of the same task or even new behaviors. Since these video world models only generate videos, we explore two techniques of getting robot actions: extracting latent actions from a general-purpose latent action model and getting predicted actions from an inverse-dynamics model (IDM), giving flexibility across diverse scenarios. Our proposed approach unlocks behavior and environment generalization, allowing a humanoid robot to perform 20+ new behaviors in unseen environments while only collecting teleoperation data for pick and place in a single environment. By introducing a new world modeling benchmark, we demonstrate that stronger video world models directly correlate with improved downstream robot policy performance. This establishes a new scaling dimension beyond simply collecting additional teleoperation data, changing how we approach robot learning.}
}

Endnote

%0 Conference Paper
%T DreamGen: Unlocking Generalization in Robot Learning through Video World Models
%A Joel Jang
%A Seonghyeon Ye
%A Zongyu Lin
%A Jiannan Xiang
%A Johan Bjorck
%A Yu Fang
%A Fengyuan Hu
%A Spencer Huang
%A Kaushil Kundalia
%A Yen-Chen Lin
%A Loïc Magne
%A Ajay Mandlekar
%A Avnish Narayan
%A You Liang Tan
%A Guanzhi Wang
%A Jing Wang
%A Qi Wang
%A Yinzhen Xu
%A Xiaohui Zeng
%A Kaiyuan Zheng
%A Ruijie Zheng
%A Ming-Yu Liu
%A Luke Zettlemoyer
%A Dieter Fox
%A Jan Kautz
%A Scott Reed
%A Yuke Zhu
%A Linxi Fan
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-jang25a
%I PMLR
%P 5170--5194
%U https://proceedings.mlr.press/v305/jang25a.html
%V 305
%X In this work, we unlock new capabilities in robot learning from neural trajectories, synthetic robot data generated from video world models. Our proposed recipe is simple, but powerful: we take the most recent state-of-the-art video generative models (world models), adapt them to the target robot embodiment, and generate new, synthetic robot data of the same task or even new behaviors. Since these video world models only generate videos, we explore two techniques of getting robot actions: extracting latent actions from a general-purpose latent action model and getting predicted actions from an inverse-dynamics model (IDM), giving flexibility across diverse scenarios. Our proposed approach unlocks behavior and environment generalization, allowing a humanoid robot to perform 20+ new behaviors in unseen environments while only collecting teleoperation data for pick and place in a single environment. By introducing a new world modeling benchmark, we demonstrate that stronger video world models directly correlate with improved downstream robot policy performance. This establishes a new scaling dimension beyond simply collecting additional teleoperation data, changing how we approach robot learning.

APA

Jang, J., Ye, S., Lin, Z., Xiang, J., Bjorck, J., Fang, Y., Hu, F., Huang, S., Kundalia, K., Lin, Y., Magne, L., Mandlekar, A., Narayan, A., Tan, Y.L., Wang, G., Wang, J., Wang, Q., Xu, Y., Zeng, X., Zheng, K., Zheng, R., Liu, M., Zettlemoyer, L., Fox, D., Kautz, J., Reed, S., Zhu, Y. & Fan, L.. (2025). DreamGen: Unlocking Generalization in Robot Learning through Video World Models. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:5170-5194 Available from https://proceedings.mlr.press/v305/jang25a.html.

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Abstract

Cite this Paper

Related Material