Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Ming Yin; Yu-Xiang Wang

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Ming Yin, Yu-Xiang Wang

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:3948-3958, 2020.

Abstract

We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy

$\pi$ using offline data collected by running a logging policy

$\mu$ . Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon

$H$ , which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order

$O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of

$H$ away from a Cramer-Rao lower bound of order

$\Omega(H^2/n)$ . In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

Cite this Paper

BibTeX


@InProceedings{pmlr-v108-yin20b,
  title = 	 {Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning},
  author =       {Yin, Ming and Wang, Yu-Xiang},
  booktitle = 	 {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics},
  pages = 	 {3948--3958},
  year = 	 {2020},
  editor = 	 {Chiappa, Silvia and Calandra, Roberto},
  volume = 	 {108},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {26--28 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v108/yin20b/yin20b.pdf},
  url = 	 {https://proceedings.mlr.press/v108/yin20b.html},
  abstract = 	 {We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$.  Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order  $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds. }
}

Endnote

%0 Conference Paper
%T Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
%A Ming Yin
%A Yu-Xiang Wang
%B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2020
%E Silvia Chiappa
%E Roberto Calandra	
%F pmlr-v108-yin20b
%I PMLR
%P 3948--3958
%U https://proceedings.mlr.press/v108/yin20b.html
%V 108
%X We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$.  Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order  $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

APA


Yin, M. & Wang, Y.. (2020). Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:3948-3958 Available from https://proceedings.mlr.press/v108/yin20b.html.

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Abstract

Cite this Paper

Related Material