Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, Huazhe Xu
Proceedings of The 8th Conference on Robot Learning, PMLR 270:1815-1833, 2025.

Abstract

Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose Maniwhere, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design **8** tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere’s strong visual generalization and sim2real transfer abilities across **3** hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://maniwhere.github.io.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-yuan25b, title = {Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning}, author = {Yuan, Zhecheng and Wei, Tianming and Cheng, Shuiqi and Zhang, Gu and Chen, Yuanpei and Xu, Huazhe}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {1815--1833}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/yuan25b/yuan25b.pdf}, url = {https://proceedings.mlr.press/v270/yuan25b.html}, abstract = {Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose Maniwhere, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design **8** tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere’s strong visual generalization and sim2real transfer abilities across **3** hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://maniwhere.github.io.} }
Endnote
%0 Conference Paper %T Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning %A Zhecheng Yuan %A Tianming Wei %A Shuiqi Cheng %A Gu Zhang %A Yuanpei Chen %A Huazhe Xu %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-yuan25b %I PMLR %P 1815--1833 %U https://proceedings.mlr.press/v270/yuan25b.html %V 270 %X Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose Maniwhere, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design **8** tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere’s strong visual generalization and sim2real transfer abilities across **3** hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://maniwhere.github.io.
APA
Yuan, Z., Wei, T., Cheng, S., Zhang, G., Chen, Y. & Xu, H.. (2025). Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:1815-1833 Available from https://proceedings.mlr.press/v270/yuan25b.html.

Related Material