Generative Image as Action Models

Mohit Shridhar, Yat Long Lo, Stephen James
Proceedings of The 8th Conference on Robot Learning, PMLR 270:2429-2455, 2025.

Abstract

Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to “draw joint-actions” as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state- of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-shridhar25a, title = {Generative Image as Action Models}, author = {Shridhar, Mohit and Lo, Yat Long and James, Stephen}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {2429--2455}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/shridhar25a/shridhar25a.pdf}, url = {https://proceedings.mlr.press/v270/shridhar25a.html}, abstract = {Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to “draw joint-actions” as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state- of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.} }
Endnote
%0 Conference Paper %T Generative Image as Action Models %A Mohit Shridhar %A Yat Long Lo %A Stephen James %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-shridhar25a %I PMLR %P 2429--2455 %U https://proceedings.mlr.press/v270/shridhar25a.html %V 270 %X Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to “draw joint-actions” as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state- of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.
APA
Shridhar, M., Lo, Y.L. & James, S.. (2025). Generative Image as Action Models. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:2429-2455 Available from https://proceedings.mlr.press/v270/shridhar25a.html.

Related Material