Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Yaqi Duan, Zeyu Jia, Mengdi Wang
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:2701-2709, 2020.

Abstract

This paper studies the statistical theory of off-policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q-iteration method, show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator, and prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted $\chi^2$-divergence characterizes the statistical limit of off-policy evaluation and is both instance-dependent and function-class-dependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-duan20b, title = {Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation}, author = {Duan, Yaqi and Jia, Zeyu and Wang, Mengdi}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {2701--2709}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/duan20b/duan20b.pdf}, url = {https://proceedings.mlr.press/v119/duan20b.html}, abstract = {This paper studies the statistical theory of off-policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q-iteration method, show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator, and prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted $\chi^2$-divergence characterizes the statistical limit of off-policy evaluation and is both instance-dependent and function-class-dependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.} }
Endnote
%0 Conference Paper %T Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation %A Yaqi Duan %A Zeyu Jia %A Mengdi Wang %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-duan20b %I PMLR %P 2701--2709 %U https://proceedings.mlr.press/v119/duan20b.html %V 119 %X This paper studies the statistical theory of off-policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q-iteration method, show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator, and prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted $\chi^2$-divergence characterizes the statistical limit of off-policy evaluation and is both instance-dependent and function-class-dependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.
APA
Duan, Y., Jia, Z. & Wang, M.. (2020). Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:2701-2709 Available from https://proceedings.mlr.press/v119/duan20b.html.

Related Material