From Importance Sampling to Doubly Robust Policy Gradient

Jiawei Huang, Nan Jiang
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4434-4443, 2020.

Abstract

We show that on-policy policy gradient (PG) and its variance reduction variants can be derived by taking finite-difference of function evaluations supplied by estimators from the importance sampling (IS) family for off-policy evaluation (OPE). Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique (Cheng et al., 2019) as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature. We analyze the variance of the new DR-PG estimator, compare it to existing methods as well as the Cramer-Rao lower bound of policy gradient, and empirically show its effectiveness.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-huang20b, title = {From Importance Sampling to Doubly Robust Policy Gradient}, author = {Huang, Jiawei and Jiang, Nan}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {4434--4443}, year = {2020}, editor = {Hal Daumé III and Aarti Singh}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/huang20b/huang20b.pdf}, url = { http://proceedings.mlr.press/v119/huang20b.html }, abstract = {We show that on-policy policy gradient (PG) and its variance reduction variants can be derived by taking finite-difference of function evaluations supplied by estimators from the importance sampling (IS) family for off-policy evaluation (OPE). Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique (Cheng et al., 2019) as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature. We analyze the variance of the new DR-PG estimator, compare it to existing methods as well as the Cramer-Rao lower bound of policy gradient, and empirically show its effectiveness.} }
Endnote
%0 Conference Paper %T From Importance Sampling to Doubly Robust Policy Gradient %A Jiawei Huang %A Nan Jiang %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-huang20b %I PMLR %P 4434--4443 %U http://proceedings.mlr.press/v119/huang20b.html %V 119 %X We show that on-policy policy gradient (PG) and its variance reduction variants can be derived by taking finite-difference of function evaluations supplied by estimators from the importance sampling (IS) family for off-policy evaluation (OPE). Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique (Cheng et al., 2019) as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature. We analyze the variance of the new DR-PG estimator, compare it to existing methods as well as the Cramer-Rao lower bound of policy gradient, and empirically show its effectiveness.
APA
Huang, J. & Jiang, N.. (2020). From Importance Sampling to Doubly Robust Policy Gradient. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4434-4443 Available from http://proceedings.mlr.press/v119/huang20b.html .

Related Material