Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudı́k
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3589-3597, 2017.

Abstract

We study the off-policy evaluation problem—estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often outperforming prior work by orders of magnitude.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-wang17a, title = {Optimal and Adaptive Off-policy Evaluation in Contextual Bandits}, author = {Yu-Xiang Wang and Alekh Agarwal and Miroslav Dud\'{\i}k}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {3589--3597}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/wang17a/wang17a.pdf}, url = {https://proceedings.mlr.press/v70/wang17a.html}, abstract = {We study the off-policy evaluation problem—estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often outperforming prior work by orders of magnitude.} }
Endnote
%0 Conference Paper %T Optimal and Adaptive Off-policy Evaluation in Contextual Bandits %A Yu-Xiang Wang %A Alekh Agarwal %A Miroslav Dudı́k %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-wang17a %I PMLR %P 3589--3597 %U https://proceedings.mlr.press/v70/wang17a.html %V 70 %X We study the off-policy evaluation problem—estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often outperforming prior work by orders of magnitude.
APA
Wang, Y., Agarwal, A. & Dudı́k, M.. (2017). Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:3589-3597 Available from https://proceedings.mlr.press/v70/wang17a.html.

Related Material