X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

Prithwish Dan; Kushal Kedia; Angela Chao; Edward Duan; Maximus Adrian Pace; Wei-Chiu Ma; Sanjiban Choudhury

X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

Prithwish Dan, Kushal Kedia, Angela Chao, Edward Duan, Maximus Adrian Pace, Wei-Chiu Ma, Sanjiban Choudhury

Proceedings of The 9th Conference on Robot Learning, PMLR 305:816-833, 2025.

Abstract

Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection, and (3) generalizes to new camera viewpoints and test-time changes.

Cite this Paper

BibTeX

@InProceedings{pmlr-v305-dan25a,
  title = 	 {X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real},
  author =       {Dan, Prithwish and Kedia, Kushal and Chao, Angela and Duan, Edward and Pace, Maximus Adrian and Ma, Wei-Chiu and Choudhury, Sanjiban},
  booktitle = 	 {Proceedings of The 9th Conference on Robot Learning},
  pages = 	 {816--833},
  year = 	 {2025},
  editor = 	 {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = 	 {305},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v305/main/assets/dan25a/dan25a.pdf},
  url = 	 {https://proceedings.mlr.press/v305/dan25a.html},
  abstract = 	 {Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection, and (3) generalizes to new camera viewpoints and test-time changes.}
}

Endnote

%0 Conference Paper
%T X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real
%A Prithwish Dan
%A Kushal Kedia
%A Angela Chao
%A Edward Duan
%A Maximus Adrian Pace
%A Wei-Chiu Ma
%A Sanjiban Choudhury
%B Proceedings of The 9th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Joseph Lim
%E Shuran Song
%E Hae-Won Park	
%F pmlr-v305-dan25a
%I PMLR
%P 816--833
%U https://proceedings.mlr.press/v305/dan25a.html
%V 305
%X Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection, and (3) generalizes to new camera viewpoints and test-time changes.

APA

Dan, P., Kedia, K., Chao, A., Duan, E., Pace, M.A., Ma, W. & Choudhury, S.. (2025). X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:816-833 Available from https://proceedings.mlr.press/v305/dan25a.html.

X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

Abstract

Cite this Paper

Related Material