Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods

Ching-An Cheng, Xinyan Yan, Byron Boots
Proceedings of the Conference on Robot Learning, PMLR 100:1379-1394, 2020.

Abstract

Policy gradient methods have demonstrated success in reinforcement learning tasks with high-dimensional continuous state and action spaces. But they are also notoriously sample inefficient, which can be attributed, at least in part, to the high variance in estimating the gradient of the task objective with Monte Carlo methods. Previous research has endeavored to contend with this problem by studying control variates (CVs) that can reduce the variance of estimates without introducing bias, including the early use of baselines, state dependent CVs, and the more recent state-action dependent CVs. In this work, we analyze the properties and drawbacks of previous CV techniques and, surprisingly, we find that these works have overlooked an important fact that Monte Carlo gradient estimates are generated by trajectories of states and actions. We show that ignoring the correlation across the trajectories can result in suboptimal variance reduction, and we propose a simple fix: a class of trajectory-wise CVs, that can further drive down the variance. The trajectory-wise CVs can be computed recursively and require only learning state-action value functions like the previous CVs for policy gradient. We further prove that the proposed trajectory-wise CVs are optimal for variance reduction under reasonable assumptions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v100-cheng20a, title = {Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods}, author = {Cheng, Ching-An and Yan, Xinyan and Boots, Byron}, booktitle = {Proceedings of the Conference on Robot Learning}, pages = {1379--1394}, year = {2020}, editor = {Kaelbling, Leslie Pack and Kragic, Danica and Sugiura, Komei}, volume = {100}, series = {Proceedings of Machine Learning Research}, month = {30 Oct--01 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v100/cheng20a/cheng20a.pdf}, url = {https://proceedings.mlr.press/v100/cheng20a.html}, abstract = {Policy gradient methods have demonstrated success in reinforcement learning tasks with high-dimensional continuous state and action spaces. But they are also notoriously sample inefficient, which can be attributed, at least in part, to the high variance in estimating the gradient of the task objective with Monte Carlo methods. Previous research has endeavored to contend with this problem by studying control variates (CVs) that can reduce the variance of estimates without introducing bias, including the early use of baselines, state dependent CVs, and the more recent state-action dependent CVs. In this work, we analyze the properties and drawbacks of previous CV techniques and, surprisingly, we find that these works have overlooked an important fact that Monte Carlo gradient estimates are generated by trajectories of states and actions. We show that ignoring the correlation across the trajectories can result in suboptimal variance reduction, and we propose a simple fix: a class of trajectory-wise CVs, that can further drive down the variance. The trajectory-wise CVs can be computed recursively and require only learning state-action value functions like the previous CVs for policy gradient. We further prove that the proposed trajectory-wise CVs are optimal for variance reduction under reasonable assumptions.} }
Endnote
%0 Conference Paper %T Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods %A Ching-An Cheng %A Xinyan Yan %A Byron Boots %B Proceedings of the Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2020 %E Leslie Pack Kaelbling %E Danica Kragic %E Komei Sugiura %F pmlr-v100-cheng20a %I PMLR %P 1379--1394 %U https://proceedings.mlr.press/v100/cheng20a.html %V 100 %X Policy gradient methods have demonstrated success in reinforcement learning tasks with high-dimensional continuous state and action spaces. But they are also notoriously sample inefficient, which can be attributed, at least in part, to the high variance in estimating the gradient of the task objective with Monte Carlo methods. Previous research has endeavored to contend with this problem by studying control variates (CVs) that can reduce the variance of estimates without introducing bias, including the early use of baselines, state dependent CVs, and the more recent state-action dependent CVs. In this work, we analyze the properties and drawbacks of previous CV techniques and, surprisingly, we find that these works have overlooked an important fact that Monte Carlo gradient estimates are generated by trajectories of states and actions. We show that ignoring the correlation across the trajectories can result in suboptimal variance reduction, and we propose a simple fix: a class of trajectory-wise CVs, that can further drive down the variance. The trajectory-wise CVs can be computed recursively and require only learning state-action value functions like the previous CVs for policy gradient. We further prove that the proposed trajectory-wise CVs are optimal for variance reduction under reasonable assumptions.
APA
Cheng, C., Yan, X. & Boots, B.. (2020). Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods. Proceedings of the Conference on Robot Learning, in Proceedings of Machine Learning Research 100:1379-1394 Available from https://proceedings.mlr.press/v100/cheng20a.html.

Related Material