Guided Exploration with Proximal Policy Optimization using a Single Demonstration

Gabriele Libardi, Gianni De Fabritiis, Sebastian Dittert
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6611-6620, 2021.

Abstract

Solving sparse reward tasks through exploration is one of the major challenges in deep reinforcement learning, especially in three-dimensional, partially-observable environments. Critically, the algorithm proposed in this article is capable of using a single human demonstration to solve hard-exploration problems. We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions and we integrate it with proximal policy optimization (PPO). The agent is also able to increase its performance and to tackle harder problems by replaying its own past trajectories prioritizing them based on the obtained reward and the maximum value of the trajectory. We finally compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To the best of our knowledge, learning a task in a three-dimensional environment with comparable difficulty has never been considered before using only one human demonstration.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-libardi21a, title = {Guided Exploration with Proximal Policy Optimization using a Single Demonstration}, author = {Libardi, Gabriele and De Fabritiis, Gianni and Dittert, Sebastian}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {6611--6620}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/libardi21a/libardi21a.pdf}, url = {https://proceedings.mlr.press/v139/libardi21a.html}, abstract = {Solving sparse reward tasks through exploration is one of the major challenges in deep reinforcement learning, especially in three-dimensional, partially-observable environments. Critically, the algorithm proposed in this article is capable of using a single human demonstration to solve hard-exploration problems. We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions and we integrate it with proximal policy optimization (PPO). The agent is also able to increase its performance and to tackle harder problems by replaying its own past trajectories prioritizing them based on the obtained reward and the maximum value of the trajectory. We finally compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To the best of our knowledge, learning a task in a three-dimensional environment with comparable difficulty has never been considered before using only one human demonstration.} }
Endnote
%0 Conference Paper %T Guided Exploration with Proximal Policy Optimization using a Single Demonstration %A Gabriele Libardi %A Gianni De Fabritiis %A Sebastian Dittert %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-libardi21a %I PMLR %P 6611--6620 %U https://proceedings.mlr.press/v139/libardi21a.html %V 139 %X Solving sparse reward tasks through exploration is one of the major challenges in deep reinforcement learning, especially in three-dimensional, partially-observable environments. Critically, the algorithm proposed in this article is capable of using a single human demonstration to solve hard-exploration problems. We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions and we integrate it with proximal policy optimization (PPO). The agent is also able to increase its performance and to tackle harder problems by replaying its own past trajectories prioritizing them based on the obtained reward and the maximum value of the trajectory. We finally compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To the best of our knowledge, learning a task in a three-dimensional environment with comparable difficulty has never been considered before using only one human demonstration.
APA
Libardi, G., De Fabritiis, G. & Dittert, S.. (2021). Guided Exploration with Proximal Policy Optimization using a Single Demonstration. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6611-6620 Available from https://proceedings.mlr.press/v139/libardi21a.html.

Related Material