Smoothed Action Value Functions for Learning Gaussian Policies

Ofir Nachum; Mohammad Norouzi; George Tucker; Dale Schuurmans

Smoothed Action Value Functions for Learning Gaussian Policies

Ofir Nachum, Mohammad Norouzi, George Tucker, Dale Schuurmans

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:3692-3700, 2018.

Abstract

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v80-nachum18a,
  title = 	 {Smoothed Action Value Functions for Learning {G}aussian Policies},
  author =       {Nachum, Ofir and Norouzi, Mohammad and Tucker, George and Schuurmans, Dale},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {3692--3700},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/nachum18a/nachum18a.pdf},
  url = 	 {https://proceedings.mlr.press/v80/nachum18a.html},
  abstract = 	 {State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.}
}

Endnote

%0 Conference Paper
%T Smoothed Action Value Functions for Learning Gaussian Policies
%A Ofir Nachum
%A Mohammad Norouzi
%A George Tucker
%A Dale Schuurmans
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-nachum18a
%I PMLR
%P 3692--3700
%U https://proceedings.mlr.press/v80/nachum18a.html
%V 80
%X State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.

APA

Nachum, O., Norouzi, M., Tucker, G. & Schuurmans, D.. (2018). Smoothed Action Value Functions for Learning Gaussian Policies. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:3692-3700 Available from https://proceedings.mlr.press/v80/nachum18a.html.

Smoothed Action Value Functions for Learning Gaussian Policies

Abstract

Cite this Paper

Related Material