Toward Minimax Off-policy Value Estimation
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR 38:608-616, 2015.
This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators. It is shown that in a large class of settings the so-called regression estimator is minimax optimal up to a constant that depends on the number of actions, while the other two can be arbitrarily worse even in the limit of infinitely many data points, despite their empirical success and popularity. The performance of these estimators are studied in synthetic and real problems; illustrating the nontriviality of this simple task. Finally the results are extended to the problem of off-policy evaluation in contextual bandits and fixed-horizon Markov decision processes.