Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Botao Hao; Xiang Ji; Yaqi Duan; Hao Lu; Csaba Szepesvari; Mengdi Wang

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Botao Hao, Xiang Ji, Yaqi Duan, Hao Lu, Csaba Szepesvari, Mengdi Wang

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4074-4084, 2021.

Abstract

Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-hao21b,
  title = 	 {Bootstrapping Fitted Q-Evaluation for Off-Policy Inference},
  author =       {Hao, Botao and Ji, Xiang and Duan, Yaqi and Lu, Hao and Szepesvari, Csaba and Wang, Mengdi},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {4074--4084},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/hao21b/hao21b.pdf},
  url = 	 {https://proceedings.mlr.press/v139/hao21b.html},
  abstract = 	 {Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.}
}

Endnote

%0 Conference Paper
%T Bootstrapping Fitted Q-Evaluation for Off-Policy Inference
%A Botao Hao
%A Xiang Ji
%A Yaqi Duan
%A Hao Lu
%A Csaba Szepesvari
%A Mengdi Wang
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-hao21b
%I PMLR
%P 4074--4084
%U https://proceedings.mlr.press/v139/hao21b.html
%V 139
%X Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

APA

Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C. & Wang, M.. (2021). Bootstrapping Fitted Q-Evaluation for Off-Policy Inference. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4074-4084 Available from https://proceedings.mlr.press/v139/hao21b.html.

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Abstract

Cite this Paper

Related Material