Policy Optimization with Demonstrations

Bingyi Kang; Zequn Jie; Jiashi Feng

Policy Optimization with Demonstrations

Bingyi Kang, Zequn Jie, Jiashi Feng

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2469-2478, 2018.

Abstract

Exploration remains a significant challenge to reinforcement learning methods, especially in environments where reward signals are sparse. Recent methods of learning from demonstrations have shown to be promising in overcoming exploration difficulties but typically require considerable high-quality demonstrations that are difficult to collect. We propose to effectively leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations, and develop a novel Policy Optimization from Demonstration (POfD) method. We show that POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Furthermore, it can be combined with policy gradient methods to produce state-of-the-art results, as demonstrated experimentally on a range of popular benchmark sparse-reward tasks, even when the demonstrations are few and imperfect.

Cite this Paper

BibTeX

@InProceedings{pmlr-v80-kang18a,
  title = 	 {Policy Optimization with Demonstrations},
  author =       {Kang, Bingyi and Jie, Zequn and Feng, Jiashi},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {2469--2478},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/kang18a/kang18a.pdf},
  url = 	 {https://proceedings.mlr.press/v80/kang18a.html},
  abstract = 	 {Exploration remains a significant challenge to reinforcement learning methods, especially in environments where reward signals are sparse. Recent methods of learning from demonstrations have shown to be promising in overcoming exploration difficulties but typically require considerable high-quality demonstrations that are difficult to collect. We propose to effectively leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations, and develop a novel Policy Optimization from Demonstration (POfD) method. We show that POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Furthermore, it can be combined with policy gradient methods to produce state-of-the-art results, as demonstrated experimentally on a range of popular benchmark sparse-reward tasks, even when the demonstrations are few and imperfect.}
}

Endnote

%0 Conference Paper
%T Policy Optimization with Demonstrations
%A Bingyi Kang
%A Zequn Jie
%A Jiashi Feng
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-kang18a
%I PMLR
%P 2469--2478
%U https://proceedings.mlr.press/v80/kang18a.html
%V 80
%X Exploration remains a significant challenge to reinforcement learning methods, especially in environments where reward signals are sparse. Recent methods of learning from demonstrations have shown to be promising in overcoming exploration difficulties but typically require considerable high-quality demonstrations that are difficult to collect. We propose to effectively leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations, and develop a novel Policy Optimization from Demonstration (POfD) method. We show that POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement. Furthermore, it can be combined with policy gradient methods to produce state-of-the-art results, as demonstrated experimentally on a range of popular benchmark sparse-reward tasks, even when the demonstrations are few and imperfect.

APA

Kang, B., Jie, Z. & Feng, J.. (2018). Policy Optimization with Demonstrations. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:2469-2478 Available from https://proceedings.mlr.press/v80/kang18a.html.

Policy Optimization with Demonstrations

Abstract

Cite this Paper

Related Material