Profitable Bandits

Mastane Achab; Stephan Clémençon; Aurélien Garivier

Profitable Bandits

Mastane Achab, Stephan Clémençon, Aurélien Garivier

Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:694-709, 2018.

Abstract

Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the

$K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.

Cite this Paper

BibTeX


@InProceedings{pmlr-v95-achab18a,
  title = 	 {Profitable Bandits},
  author =       {Achab, Mastane and Cl\'emen\c{c}on, Stephan and Garivier, Aur\'elien},
  booktitle = 	 {Proceedings of The 10th Asian Conference on Machine Learning},
  pages = 	 {694--709},
  year = 	 {2018},
  editor = 	 {Zhu, Jun and Takeuchi, Ichiro},
  volume = 	 {95},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--16 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v95/achab18a/achab18a.pdf},
  url = 	 {https://proceedings.mlr.press/v95/achab18a.html},
  abstract = 	 {Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.}
}

Endnote

%0 Conference Paper
%T Profitable Bandits
%A Mastane Achab
%A Stephan Clémençon
%A Aurélien Garivier
%B Proceedings of The 10th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jun Zhu
%E Ichiro Takeuchi	
%F pmlr-v95-achab18a
%I PMLR
%P 694--709
%U https://proceedings.mlr.press/v95/achab18a.html
%V 95
%X Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.

APA


Achab, M., Clémençon, S. & Garivier, A.. (2018). Profitable Bandits. Proceedings of The 10th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 95:694-709 Available from https://proceedings.mlr.press/v95/achab18a.html.

Profitable Bandits

Abstract

Cite this Paper

Related Material