More Efficient Off-Policy Evaluation through Regularized Targeted Learning

Aurelien Bibaut, Ivana Malenica, Nikos Vlassis, Mark Van Der Laan
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:654-663, 2019.

Abstract

We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their $O_P(1/\sqrt{n})$ rate of convergence and characterizing their asymptotic distribution.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-bibaut19a, title = {More Efficient Off-Policy Evaluation through Regularized Targeted Learning}, author = {Bibaut, Aurelien and Malenica, Ivana and Vlassis, Nikos and Van Der Laan, Mark}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {654--663}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/bibaut19a/bibaut19a.pdf}, url = {https://proceedings.mlr.press/v97/bibaut19a.html}, abstract = {We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their $O_P(1/\sqrt{n})$ rate of convergence and characterizing their asymptotic distribution.} }
Endnote
%0 Conference Paper %T More Efficient Off-Policy Evaluation through Regularized Targeted Learning %A Aurelien Bibaut %A Ivana Malenica %A Nikos Vlassis %A Mark Van Der Laan %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-bibaut19a %I PMLR %P 654--663 %U https://proceedings.mlr.press/v97/bibaut19a.html %V 97 %X We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their $O_P(1/\sqrt{n})$ rate of convergence and characterizing their asymptotic distribution.
APA
Bibaut, A., Malenica, I., Vlassis, N. & Van Der Laan, M.. (2019). More Efficient Off-Policy Evaluation through Regularized Targeted Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:654-663 Available from https://proceedings.mlr.press/v97/bibaut19a.html.

Related Material