Smoothed Action Value Functions for Learning Gaussian Policies

Ofir Nachum, Mohammad Norouzi, George Tucker, Dale Schuurmans
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:3692-3700, 2018.

Abstract

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-nachum18a, title = {Smoothed Action Value Functions for Learning {G}aussian Policies}, author = {Nachum, Ofir and Norouzi, Mohammad and Tucker, George and Schuurmans, Dale}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {3692--3700}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/nachum18a/nachum18a.pdf}, url = {https://proceedings.mlr.press/v80/nachum18a.html}, abstract = {State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.} }
Endnote
%0 Conference Paper %T Smoothed Action Value Functions for Learning Gaussian Policies %A Ofir Nachum %A Mohammad Norouzi %A George Tucker %A Dale Schuurmans %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-nachum18a %I PMLR %P 3692--3700 %U https://proceedings.mlr.press/v80/nachum18a.html %V 80 %X State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.
APA
Nachum, O., Norouzi, M., Tucker, G. & Schuurmans, D.. (2018). Smoothed Action Value Functions for Learning Gaussian Policies. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:3692-3700 Available from https://proceedings.mlr.press/v80/nachum18a.html.

Related Material