POIL:Preference Optimization for Imitation Learning

Kuanyen Liu; Renjyun Huang; Chang Chih Meng; I-Chen Wu

POIL:Preference Optimization for Imitation Learning

Kuanyen Liu, Renjyun Huang, Chang Chih Meng, I-Chen Wu

Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:558-573, 2025.

Abstract

Imitation learning (IL) enables agents to learn policies by mimicking expert demonstrations. While online IL methods require interaction with the environment, which is costly, risky, or impractical, offline IL allows agents to learn solely from expert datasets without any interaction with the environment.In this paper, we propose Preference Optimization for Imitation Learning (POIL), a novel approach inspired by preference optimization techniques in large language model alignment. POIL eliminates the need for adversarial training and reference models by directly comparing the agent’s actions to expert actions using a preference-based loss function. We evaluate POIL on MuJoCo control tasks and Adroit manipulation tasks.Our experiments show that POIL consistently delivers superior or competitive performance against state-of-the-art methods in the past, including Behavioral Cloning (BC), IQ-Learn, MCNN, and O-DICE, especially in data-scarce scenarios, such as using single trajectory.These results demonstrate that POIL enhances data efficiency and stability in offline imitation learning, making it a promising solution for applications where environment interaction is infeasible and expert data is limited, even in high-dimensional and complex control tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v304-liu25c,
  title = 	 {POIL:Preference Optimization for Imitation Learning},
  author =       {Liu, Kuanyen and Huang, Renjyun and Meng, Chang Chih and Wu, I-Chen},
  booktitle = 	 {Proceedings of the 17th Asian Conference on Machine Learning},
  pages = 	 {558--573},
  year = 	 {2025},
  editor = 	 {Lee, Hung-yi and Liu, Tongliang},
  volume = 	 {304},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v304/main/assets/liu25c/liu25c.pdf},
  url = 	 {https://proceedings.mlr.press/v304/liu25c.html},
  abstract = 	 {Imitation learning (IL) enables agents to learn policies by mimicking expert demonstrations. While online IL methods require interaction with the environment, which is costly, risky, or impractical, offline IL allows agents to learn solely from expert datasets without any interaction with the environment.In this paper, we propose Preference Optimization for Imitation Learning (POIL), a novel approach inspired by preference optimization techniques in large language model alignment. POIL eliminates the need for adversarial training and reference models by directly comparing the agent’s actions to expert actions using a preference-based loss function. We evaluate POIL on MuJoCo control tasks and Adroit manipulation tasks.Our experiments show that POIL consistently delivers superior or competitive performance against state-of-the-art methods in the past, including Behavioral Cloning (BC), IQ-Learn, MCNN, and O-DICE, especially in data-scarce scenarios, such as using single trajectory.These results demonstrate that POIL enhances data efficiency and stability in offline imitation learning, making it a promising solution for applications where environment interaction is infeasible and expert data is limited, even in high-dimensional and complex control tasks.}
}

Endnote

%0 Conference Paper
%T POIL:Preference Optimization for Imitation Learning
%A Kuanyen Liu
%A Renjyun Huang
%A Chang Chih Meng
%A I-Chen Wu
%B Proceedings of the 17th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Hung-yi Lee
%E Tongliang Liu	
%F pmlr-v304-liu25c
%I PMLR
%P 558--573
%U https://proceedings.mlr.press/v304/liu25c.html
%V 304
%X Imitation learning (IL) enables agents to learn policies by mimicking expert demonstrations. While online IL methods require interaction with the environment, which is costly, risky, or impractical, offline IL allows agents to learn solely from expert datasets without any interaction with the environment.In this paper, we propose Preference Optimization for Imitation Learning (POIL), a novel approach inspired by preference optimization techniques in large language model alignment. POIL eliminates the need for adversarial training and reference models by directly comparing the agent’s actions to expert actions using a preference-based loss function. We evaluate POIL on MuJoCo control tasks and Adroit manipulation tasks.Our experiments show that POIL consistently delivers superior or competitive performance against state-of-the-art methods in the past, including Behavioral Cloning (BC), IQ-Learn, MCNN, and O-DICE, especially in data-scarce scenarios, such as using single trajectory.These results demonstrate that POIL enhances data efficiency and stability in offline imitation learning, making it a promising solution for applications where environment interaction is infeasible and expert data is limited, even in high-dimensional and complex control tasks.

APA

Liu, K., Huang, R., Meng, C.C. & Wu, I.. (2025). POIL:Preference Optimization for Imitation Learning. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:558-573 Available from https://proceedings.mlr.press/v304/liu25c.html.

Related Material

Download PDF