Optimal Thompson Sampling strategies for support-aware CVaR bandits

Dorian Baudry, Romain Gautron, Emilie Kaufmann, Odalric Maillard
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:716-726, 2021.

Abstract

In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-baudry21a, title = {Optimal Thompson Sampling strategies for support-aware CVaR bandits}, author = {Baudry, Dorian and Gautron, Romain and Kaufmann, Emilie and Maillard, Odalric}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {716--726}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/baudry21a/baudry21a.pdf}, url = {https://proceedings.mlr.press/v139/baudry21a.html}, abstract = {In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.} }
Endnote
%0 Conference Paper %T Optimal Thompson Sampling strategies for support-aware CVaR bandits %A Dorian Baudry %A Romain Gautron %A Emilie Kaufmann %A Odalric Maillard %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-baudry21a %I PMLR %P 716--726 %U https://proceedings.mlr.press/v139/baudry21a.html %V 139 %X In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.
APA
Baudry, D., Gautron, R., Kaufmann, E. & Maillard, O.. (2021). Optimal Thompson Sampling strategies for support-aware CVaR bandits. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:716-726 Available from https://proceedings.mlr.press/v139/baudry21a.html.

Related Material