Toward Minimax Off-policy Value Estimation

Lihong Li; Remi Munos; Csaba Szepesvari

Toward Minimax Off-policy Value Estimation

Lihong Li, Remi Munos, Csaba Szepesvari

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR 38:608-616, 2015.

Abstract

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators. It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity. The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task. Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.

Cite this Paper

BibTeX


@InProceedings{pmlr-v38-li15b,
  title = 	 {{Toward Minimax Off-policy Value Estimation}},
  author = 	 {Li, Lihong and Munos, Remi and Szepesvari, Csaba},
  booktitle = 	 {Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {608--616},
  year = 	 {2015},
  editor = 	 {Lebanon, Guy and Vishwanathan, S. V. N.},
  volume = 	 {38},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {San Diego, California, USA},
  month = 	 {09--12 May},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v38/li15b.pdf},
  url = 	 {https://proceedings.mlr.press/v38/li15b.html},
  abstract = 	 {This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.  We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators.  It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity.  The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task.  Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.}
}

Endnote

%0 Conference Paper
%T Toward Minimax Off-policy Value Estimation
%A Lihong Li
%A Remi Munos
%A Csaba Szepesvari
%B Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2015
%E Guy Lebanon
%E S. V. N. Vishwanathan	
%F pmlr-v38-li15b
%I PMLR
%P 608--616
%U https://proceedings.mlr.press/v38/li15b.html
%V 38
%X This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.  We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators.  It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity.  The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task.  Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.

RIS


TY  - CPAPER
TI  - Toward Minimax Off-policy Value Estimation
AU  - Lihong Li
AU  - Remi Munos
AU  - Csaba Szepesvari
BT  - Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
DA  - 2015/02/21
ED  - Guy Lebanon
ED  - S. V. N. Vishwanathan	
ID  - pmlr-v38-li15b
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 38
SP  - 608
EP  - 616
L1  - http://proceedings.mlr.press/v38/li15b.pdf
UR  - https://proceedings.mlr.press/v38/li15b.html
AB  - This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.  We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators.  It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity.  The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task.  Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.
ER  -

APA


Li, L., Munos, R. & Szepesvari, C.. (2015). Toward Minimax Off-policy Value Estimation. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 38:608-616 Available from https://proceedings.mlr.press/v38/li15b.html.

Toward Minimax Off-policy Value Estimation

Abstract

Cite this Paper

Related Material