On the Design of Estimators for Bandit Off-Policy Evaluation

Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, Tony Jebara
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:6468-6476, 2019.

Abstract

Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-vlassis19a, title = {On the Design of Estimators for Bandit Off-Policy Evaluation}, author = {Vlassis, Nikos and Bibaut, Aurelien and Dimakopoulou, Maria and Jebara, Tony}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {6468--6476}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/vlassis19a/vlassis19a.pdf}, url = {https://proceedings.mlr.press/v97/vlassis19a.html}, abstract = {Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.} }
Endnote
%0 Conference Paper %T On the Design of Estimators for Bandit Off-Policy Evaluation %A Nikos Vlassis %A Aurelien Bibaut %A Maria Dimakopoulou %A Tony Jebara %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-vlassis19a %I PMLR %P 6468--6476 %U https://proceedings.mlr.press/v97/vlassis19a.html %V 97 %X Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.
APA
Vlassis, N., Bibaut, A., Dimakopoulou, M. & Jebara, T.. (2019). On the Design of Estimators for Bandit Off-Policy Evaluation. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:6468-6476 Available from https://proceedings.mlr.press/v97/vlassis19a.html.

Related Material