On the Design of Estimators for Bandit Off-Policy Evaluation

Nikos Vlassis; Aurelien Bibaut; Maria Dimakopoulou; Tony Jebara

On the Design of Estimators for Bandit Off-Policy Evaluation

Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, Tony Jebara

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:6468-6476, 2019.

Abstract

Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

Cite this Paper

BibTeX


@InProceedings{pmlr-v97-vlassis19a,
  title = 	 {On the Design of Estimators for Bandit Off-Policy Evaluation},
  author =       {Vlassis, Nikos and Bibaut, Aurelien and Dimakopoulou, Maria and Jebara, Tony},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {6468--6476},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/vlassis19a/vlassis19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/vlassis19a.html},
  abstract = 	 {Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.}
}

Endnote

%0 Conference Paper
%T On the Design of Estimators for Bandit Off-Policy Evaluation
%A Nikos Vlassis
%A Aurelien Bibaut
%A Maria Dimakopoulou
%A Tony Jebara
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-vlassis19a
%I PMLR
%P 6468--6476
%U https://proceedings.mlr.press/v97/vlassis19a.html
%V 97
%X Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

APA


Vlassis, N., Bibaut, A., Dimakopoulou, M. & Jebara, T.. (2019). On the Design of Estimators for Bandit Off-Policy Evaluation. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:6468-6476 Available from https://proceedings.mlr.press/v97/vlassis19a.html.

Related Material

Download PDF