Truly Proximal Policy Optimization

Yuhui Wang, Hao He, Xiaoyang Tan
Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:113-122, 2020.

Abstract

Proximal policy optimization (PPO) is one of the most successful deep reinforcement learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the probability ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Trust Region-based PPO with Rollback (TR-PPO-RB). Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the ratio between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, which is theoretically justified according to the trust region theorem. It seems, by adhering more truly to the “proximal” property − restricting the policy within the trust region, the new algorithm improves the original PPO on both stability and sample efficiency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v115-wang20b, title = {Truly Proximal Policy Optimization}, author = {Wang, Yuhui and He, Hao and Tan, Xiaoyang}, booktitle = {Proceedings of The 35th Uncertainty in Artificial Intelligence Conference}, pages = {113--122}, year = {2020}, editor = {Adams, Ryan P. and Gogate, Vibhav}, volume = {115}, series = {Proceedings of Machine Learning Research}, month = {22--25 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v115/wang20b/wang20b.pdf}, url = {https://proceedings.mlr.press/v115/wang20b.html}, abstract = {Proximal policy optimization (PPO) is one of the most successful deep reinforcement learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the probability ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Trust Region-based PPO with Rollback (TR-PPO-RB). Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the ratio between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, which is theoretically justified according to the trust region theorem. It seems, by adhering more truly to the “proximal” property − restricting the policy within the trust region, the new algorithm improves the original PPO on both stability and sample efficiency.} }
Endnote
%0 Conference Paper %T Truly Proximal Policy Optimization %A Yuhui Wang %A Hao He %A Xiaoyang Tan %B Proceedings of The 35th Uncertainty in Artificial Intelligence Conference %C Proceedings of Machine Learning Research %D 2020 %E Ryan P. Adams %E Vibhav Gogate %F pmlr-v115-wang20b %I PMLR %P 113--122 %U https://proceedings.mlr.press/v115/wang20b.html %V 115 %X Proximal policy optimization (PPO) is one of the most successful deep reinforcement learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the probability ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Trust Region-based PPO with Rollback (TR-PPO-RB). Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the ratio between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, which is theoretically justified according to the trust region theorem. It seems, by adhering more truly to the “proximal” property − restricting the policy within the trust region, the new algorithm improves the original PPO on both stability and sample efficiency.
APA
Wang, Y., He, H. & Tan, X.. (2020). Truly Proximal Policy Optimization. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, in Proceedings of Machine Learning Research 115:113-122 Available from https://proceedings.mlr.press/v115/wang20b.html.

Related Material