Stable Policy Optimization via Off-Policy Divergence Regularization

Ahmed Touati; Amy Zhang; Joelle Pineau; Pascal Vincent

Stable Policy Optimization via Off-Policy Divergence Regularization

Ahmed Touati, Amy Zhang, Joelle Pineau, Pascal Vincent

Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR 124:1328-1337, 2020.

Abstract

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v124-touati20a,
  title = 	 {Stable Policy Optimization via Off-Policy Divergence Regularization},
  author =       {Touati, Ahmed and Zhang, Amy and Pineau, Joelle and Vincent, Pascal},
  booktitle = 	 {Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)},
  pages = 	 {1328--1337},
  year = 	 {2020},
  editor = 	 {Peters, Jonas and Sontag, David},
  volume = 	 {124},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--06 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v124/touati20a/touati20a.pdf},
  url = 	 {https://proceedings.mlr.press/v124/touati20a.html},
  abstract = 	 {Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.}
}

Endnote

%0 Conference Paper
%T Stable Policy Optimization via Off-Policy Divergence Regularization
%A Ahmed Touati
%A Amy Zhang
%A Joelle Pineau
%A Pascal Vincent
%B Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)
%C Proceedings of Machine Learning Research
%D 2020
%E Jonas Peters
%E David Sontag	
%F pmlr-v124-touati20a
%I PMLR
%P 1328--1337
%U https://proceedings.mlr.press/v124/touati20a.html
%V 124
%X Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a wide range of challenging tasks, there is room for improvement in the stabilization of the policy learning and how the off-policy data are used. In this paper we revisit the theoretical foundations of these algorithms and propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. This proximity term, expressed in terms of the divergence between the visitation distributions, is learned in an off-policy and adversarial manner. We empirically show that our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.

APA

Touati, A., Zhang, A., Pineau, J. & Vincent, P.. (2020). Stable Policy Optimization via Off-Policy Divergence Regularization. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), in Proceedings of Machine Learning Research 124:1328-1337 Available from https://proceedings.mlr.press/v124/touati20a.html.

Stable Policy Optimization via Off-Policy Divergence Regularization

Abstract

Cite this Paper

Related Material