Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Botao Hao, Xiang Ji, Yaqi Duan, Hao Lu, Csaba Szepesvari, Mengdi Wang
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4074-4084, 2021.

Abstract

Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-hao21b, title = {Bootstrapping Fitted Q-Evaluation for Off-Policy Inference}, author = {Hao, Botao and Ji, Xiang and Duan, Yaqi and Lu, Hao and Szepesvari, Csaba and Wang, Mengdi}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {4074--4084}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/hao21b/hao21b.pdf}, url = {https://proceedings.mlr.press/v139/hao21b.html}, abstract = {Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.} }
Endnote
%0 Conference Paper %T Bootstrapping Fitted Q-Evaluation for Off-Policy Inference %A Botao Hao %A Xiang Ji %A Yaqi Duan %A Hao Lu %A Csaba Szepesvari %A Mengdi Wang %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-hao21b %I PMLR %P 4074--4084 %U https://proceedings.mlr.press/v139/hao21b.html %V 139 %X Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
APA
Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C. & Wang, M.. (2021). Bootstrapping Fitted Q-Evaluation for Off-Policy Inference. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4074-4084 Available from https://proceedings.mlr.press/v139/hao21b.html.

Related Material