Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization

Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:1999-2009, 2021.

Abstract

Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-chung21a, title = {Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization}, author = {Chung, Wesley and Thomas, Valentin and Machado, Marlos C. and Roux, Nicolas Le}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {1999--2009}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/chung21a/chung21a.pdf}, url = {https://proceedings.mlr.press/v139/chung21a.html}, abstract = {Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.} }
Endnote
%0 Conference Paper %T Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization %A Wesley Chung %A Valentin Thomas %A Marlos C. Machado %A Nicolas Le Roux %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-chung21a %I PMLR %P 1999--2009 %U https://proceedings.mlr.press/v139/chung21a.html %V 139 %X Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that the standard view is too limited for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
APA
Chung, W., Thomas, V., Machado, M.C. & Roux, N.L.. (2021). Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:1999-2009 Available from https://proceedings.mlr.press/v139/chung21a.html.

Related Material