The Mirage of Action-Dependent Baselines in Reinforcement Learning

George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, Sergey Levine
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5015-5024, 2018.

Abstract

Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-tucker18a, title = {The Mirage of Action-Dependent Baselines in Reinforcement Learning}, author = {Tucker, George and Bhupatiraju, Surya and Gu, Shixiang and Turner, Richard and Ghahramani, Zoubin and Levine, Sergey}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5015--5024}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/tucker18a/tucker18a.pdf}, url = {http://proceedings.mlr.press/v80/tucker18a.html}, abstract = {Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.} }
Endnote
%0 Conference Paper %T The Mirage of Action-Dependent Baselines in Reinforcement Learning %A George Tucker %A Surya Bhupatiraju %A Shixiang Gu %A Richard Turner %A Zoubin Ghahramani %A Sergey Levine %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-tucker18a %I PMLR %P 5015--5024 %U http://proceedings.mlr.press/v80/tucker18a.html %V 80 %X Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.
APA
Tucker, G., Bhupatiraju, S., Gu, S., Turner, R., Ghahramani, Z. & Levine, S.. (2018). The Mirage of Action-Dependent Baselines in Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5015-5024 Available from http://proceedings.mlr.press/v80/tucker18a.html.

Related Material