Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qianqian Wang, Ken Goldberg, Angjoo Kanazawa
Proceedings of The 8th Conference on Robot Learning, PMLR 270:587-603, 2025.

Abstract

Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi- view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to re- cover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part- centric trajectories, RSRD focuses on replicating the demonstration’s intended behavior while considering the robot’s own morphological limits, rather than at- tempting to reproduce the hand’s motion. We evaluate 4D-DPM’s 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD’s physical ex- ecution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end- to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models — without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-kerr25a, title = {Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction}, author = {Kerr, Justin and Kim, Chung Min and Wu, Mingxuan and Yi, Brent and Wang, Qianqian and Goldberg, Ken and Kanazawa, Angjoo}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {587--603}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/kerr25a/kerr25a.pdf}, url = {https://proceedings.mlr.press/v270/kerr25a.html}, abstract = {Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi- view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to re- cover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part- centric trajectories, RSRD focuses on replicating the demonstration’s intended behavior while considering the robot’s own morphological limits, rather than at- tempting to reproduce the hand’s motion. We evaluate 4D-DPM’s 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD’s physical ex- ecution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end- to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models — without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io} }
Endnote
%0 Conference Paper %T Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction %A Justin Kerr %A Chung Min Kim %A Mingxuan Wu %A Brent Yi %A Qianqian Wang %A Ken Goldberg %A Angjoo Kanazawa %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-kerr25a %I PMLR %P 587--603 %U https://proceedings.mlr.press/v270/kerr25a.html %V 270 %X Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi- view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to re- cover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part- centric trajectories, RSRD focuses on replicating the demonstration’s intended behavior while considering the robot’s own morphological limits, rather than at- tempting to reproduce the hand’s motion. We evaluate 4D-DPM’s 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD’s physical ex- ecution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end- to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models — without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io
APA
Kerr, J., Kim, C.M., Wu, M., Yi, B., Wang, Q., Goldberg, K. & Kanazawa, A.. (2025). Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:587-603 Available from https://proceedings.mlr.press/v270/kerr25a.html.

Related Material