Imitation Learning from a Single Temporally Misaligned Video

William Huey; Huaxiaoyue Wang; Anne Wu; Yoav Artzi; Sanjiban Choudhury

Imitation Learning from a Single Temporally Misaligned Video

William Huey, Huaxiaoyue Wang, Anne Wu, Yoav Artzi, Sanjiban Choudhury

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:26283-26310, 2025.

Abstract

We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. The project website is at https://portal-cornell.github.io/orca/

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-huey25a,
  title = 	 {Imitation Learning from a Single Temporally Misaligned Video},
  author =       {Huey, William and Wang, Huaxiaoyue and Wu, Anne and Artzi, Yoav and Choudhury, Sanjiban},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {26283--26310},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/huey25a/huey25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/huey25a.html},
  abstract = 	 {We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. The project website is at https://portal-cornell.github.io/orca/}
}

Endnote

%0 Conference Paper
%T Imitation Learning from a Single Temporally Misaligned Video
%A William Huey
%A Huaxiaoyue Wang
%A Anne Wu
%A Yoav Artzi
%A Sanjiban Choudhury
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-huey25a
%I PMLR
%P 26283--26310
%U https://proceedings.mlr.press/v267/huey25a.html
%V 267
%X We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. The project website is at https://portal-cornell.github.io/orca/

APA

Huey, W., Wang, H., Wu, A., Artzi, Y. & Choudhury, S.. (2025). Imitation Learning from a Single Temporally Misaligned Video. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:26283-26310 Available from https://proceedings.mlr.press/v267/huey25a.html.

Imitation Learning from a Single Temporally Misaligned Video

Abstract

Cite this Paper

Related Material