P3O: Policy-on Policy-off Policy Optimization

Rasool Fakoor, Pratik Chaudhari, Alexander J. Smola
Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:1017-1027, 2020.

Abstract

On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms. Code to reproduce experiments in this paper is at https://github.com/rasoolfa/P3O.

Cite this Paper


BibTeX
@InProceedings{pmlr-v115-fakoor20a, title = {P3O: Policy-on Policy-off Policy Optimization}, author = {Fakoor, Rasool and Chaudhari, Pratik and Smola, Alexander J.}, booktitle = {Proceedings of The 35th Uncertainty in Artificial Intelligence Conference}, pages = {1017--1027}, year = {2020}, editor = {Adams, Ryan P. and Gogate, Vibhav}, volume = {115}, series = {Proceedings of Machine Learning Research}, month = {22--25 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v115/fakoor20a/fakoor20a.pdf}, url = {https://proceedings.mlr.press/v115/fakoor20a.html}, abstract = {On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms. Code to reproduce experiments in this paper is at https://github.com/rasoolfa/P3O.} }
Endnote
%0 Conference Paper %T P3O: Policy-on Policy-off Policy Optimization %A Rasool Fakoor %A Pratik Chaudhari %A Alexander J. Smola %B Proceedings of The 35th Uncertainty in Artificial Intelligence Conference %C Proceedings of Machine Learning Research %D 2020 %E Ryan P. Adams %E Vibhav Gogate %F pmlr-v115-fakoor20a %I PMLR %P 1017--1027 %U https://proceedings.mlr.press/v115/fakoor20a.html %V 115 %X On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms. Code to reproduce experiments in this paper is at https://github.com/rasoolfa/P3O.
APA
Fakoor, R., Chaudhari, P. & Smola, A.J.. (2020). P3O: Policy-on Policy-off Policy Optimization. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, in Proceedings of Machine Learning Research 115:1017-1027 Available from https://proceedings.mlr.press/v115/fakoor20a.html.

Related Material