Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani
Proceedings of The 9th Conference on Robot Learning, PMLR 305:3936-3951, 2025.

Abstract

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. \textit{Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.} To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-bharadhwaj25a, title = {Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation}, author = {Bharadhwaj, Homanga and Dwibedi, Debidatta and Gupta, Abhinav and Tulsiani, Shubham and Doersch, Carl and Xiao, Ted and Shah, Dhruv and Xia, Fei and Sadigh, Dorsa and Kirmani, Sean}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {3936--3951}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/bharadhwaj25a/bharadhwaj25a.pdf}, url = {https://proceedings.mlr.press/v305/bharadhwaj25a.html}, abstract = {How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. \textit{Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.} To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.} }
Endnote
%0 Conference Paper %T Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation %A Homanga Bharadhwaj %A Debidatta Dwibedi %A Abhinav Gupta %A Shubham Tulsiani %A Carl Doersch %A Ted Xiao %A Dhruv Shah %A Fei Xia %A Dorsa Sadigh %A Sean Kirmani %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-bharadhwaj25a %I PMLR %P 3936--3951 %U https://proceedings.mlr.press/v305/bharadhwaj25a.html %V 305 %X How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. \textit{Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.} To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
APA
Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D. & Kirmani, S.. (2025). Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3936-3951 Available from https://proceedings.mlr.press/v305/bharadhwaj25a.html.

Related Material