Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation

Chuye Zhang, Xiaoxiong Zhang, Linfang Zheng, Wei Pan, Wei Zhang
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2823-2846, 2025.

Abstract

Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-zhang25g, title = {Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation}, author = {Zhang, Chuye and Zhang, Xiaoxiong and Zheng, Linfang and Pan, Wei and Zhang, Wei}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2823--2846}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/zhang25g/zhang25g.pdf}, url = {https://proceedings.mlr.press/v305/zhang25g.html}, abstract = {Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems} }
Endnote
%0 Conference Paper %T Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation %A Chuye Zhang %A Xiaoxiong Zhang %A Linfang Zheng %A Wei Pan %A Wei Zhang %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-zhang25g %I PMLR %P 2823--2846 %U https://proceedings.mlr.press/v305/zhang25g.html %V 305 %X Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems
APA
Zhang, C., Zhang, X., Zheng, L., Pan, W. & Zhang, W.. (2025). Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2823-2846 Available from https://proceedings.mlr.press/v305/zhang25g.html.

Related Material