Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

Po-Wei Chou; Daniel Maturana; Sebastian Scherer

Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

Po-Wei Chou, Daniel Maturana, Sebastian Scherer

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:834-843, 2017.

Abstract

Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world control problems, the actions one can take are bounded by physical constraints, which introduces a bias when the standard Gaussian distribution is used as the stochastic policy. In this work, we propose to use the Beta distribution as an alternative and analyze the bias and variance of the policy gradients of both policies. We show that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization (TRPO) and actor critic with experience replay (ACER), the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym’s and MuJoCo’s continuous control environments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v70-chou17a,
  title = 	 {Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution},
  author =       {Po-Wei Chou and Daniel Maturana and Sebastian Scherer},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {834--843},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/chou17a/chou17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/chou17a.html},
  abstract = 	 {Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world control problems, the actions one can take are bounded by physical constraints, which introduces a bias when the standard Gaussian distribution is used as the stochastic policy. In this work, we propose to use the Beta distribution as an alternative and analyze the bias and variance of the policy gradients of both policies. We show that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization (TRPO) and actor critic with experience replay (ACER), the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym’s and MuJoCo’s continuous control environments.}
}

Endnote

%0 Conference Paper
%T Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution
%A Po-Wei Chou
%A Daniel Maturana
%A Sebastian Scherer
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-chou17a
%I PMLR
%P 834--843
%U https://proceedings.mlr.press/v70/chou17a.html
%V 70
%X Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world control problems, the actions one can take are bounded by physical constraints, which introduces a bias when the standard Gaussian distribution is used as the stochastic policy. In this work, we propose to use the Beta distribution as an alternative and analyze the bias and variance of the policy gradients of both policies. We show that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization (TRPO) and actor critic with experience replay (ACER), the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym’s and MuJoCo’s continuous control environments.

APA


Chou, P., Maturana, D. & Scherer, S.. (2017). Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:834-843 Available from https://proceedings.mlr.press/v70/chou17a.html.

Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

Abstract

Cite this Paper

Related Material