Profitable Bandits
[edit]
Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:694709, 2018.
Abstract
Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three wellknown strategies in this purpose, that were proved to be most efficient in other settings: \textsc{klUCB}, \textsc{BayesUCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, selfcontained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.
Related Material


