Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation

Chuye Zhang; Xiaoxiong Zhang; Linfang Zheng; Wei Pan; Wei Zhang

Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation

Chuye Zhang, Xiaoxiong Zhang, Linfang Zheng, Wei Pan, Wei Zhang

Proceedings of The 9th Conference on Robot Learning, PMLR 305:2823-2846, 2025.

Abstract

Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-zhang25g,
  title = 	 {Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation},
  author =       {Zhang, Chuye and Zhang, Xiaoxiong and Zheng, Linfang and Pan, Wei and Zhang, Wei},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {2823--2846},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/zhang25g/zhang25g.pdf},
  url = 	 {https://proceedings.mlr.press/v305/zhang25g.html},
  abstract = 	 {Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems}
}

Endnote

%0 Conference Paper
%T Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation
%A Chuye Zhang
%A Xiaoxiong Zhang
%A Linfang Zheng
%A Wei Pan
%A Wei Zhang
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-zhang25g
%I PMLR
%P 2823--2846
%U https://proceedings.mlr.press/v305/zhang25g.html
%V 305
%X Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems

APA

Zhang, C., Zhang, X., Zheng, L., Pan, W. & Zhang, W.. (2025). Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2823-2846 Available from https://proceedings.mlr.press/v305/zhang25g.html.

Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation

Abstract

Cite this Paper

Related Material