Profitable Bandits

Mastane Achab, Stephan Clémençon, Aurélien Garivier
Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:694-709, 2018.

Abstract

Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v95-achab18a, title = {Profitable Bandits}, author = {Achab, Mastane and Cl\'emen\c{c}on, Stephan and Garivier, Aur\'elien}, booktitle = {Proceedings of The 10th Asian Conference on Machine Learning}, pages = {694--709}, year = {2018}, editor = {Zhu, Jun and Takeuchi, Ichiro}, volume = {95}, series = {Proceedings of Machine Learning Research}, month = {14--16 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v95/achab18a/achab18a.pdf}, url = {https://proceedings.mlr.press/v95/achab18a.html}, abstract = {Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.} }
Endnote
%0 Conference Paper %T Profitable Bandits %A Mastane Achab %A Stephan Clémençon %A Aurélien Garivier %B Proceedings of The 10th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jun Zhu %E Ichiro Takeuchi %F pmlr-v95-achab18a %I PMLR %P 694--709 %U https://proceedings.mlr.press/v95/achab18a.html %V 95 %X Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.
APA
Achab, M., Clémençon, S. & Garivier, A.. (2018). Profitable Bandits. Proceedings of The 10th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 95:694-709 Available from https://proceedings.mlr.press/v95/achab18a.html.

Related Material