Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation

Nathan Kallus, Masatoshi Uehara
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5078-5088, 2020.

Abstract

Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-kallus20b, title = {Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation}, author = {Kallus, Nathan and Uehara, Masatoshi}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {5078--5088}, year = {2020}, editor = {Hal Daumé III and Aarti Singh}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/kallus20b/kallus20b.pdf}, url = { http://proceedings.mlr.press/v119/kallus20b.html }, abstract = {Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness.} }
Endnote
%0 Conference Paper %T Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation %A Nathan Kallus %A Masatoshi Uehara %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-kallus20b %I PMLR %P 5078--5088 %U http://proceedings.mlr.press/v119/kallus20b.html %V 119 %X Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness.
APA
Kallus, N. & Uehara, M.. (2020). Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:5078-5088 Available from http://proceedings.mlr.press/v119/kallus20b.html .

Related Material