Smoothed Action Value Functions for Learning Gaussian Policies
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:36923700, 2018.
Abstract
Stateaction value functions (i.e., Qvalues) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Qlearning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Qvalue. We show that such smoothed Qvalues still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Qvalue function. Based on these relationships we develop new algorithms for training a Gaussian policy directly from a learned smoothed Qvalue approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KLdivergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.
Related Material


