The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

Aurélien Garivier, Olivier Cappé
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:359-376, 2011.

Abstract

This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v19-garivier11a, title = {The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond}, author = {Garivier, Aurélien and Cappé, Olivier}, booktitle = {Proceedings of the 24th Annual Conference on Learning Theory}, pages = {359--376}, year = {2011}, editor = {Kakade, Sham M. and von Luxburg, Ulrike}, volume = {19}, series = {Proceedings of Machine Learning Research}, address = {Budapest, Hungary}, month = {09--11 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v19/garivier11a/garivier11a.pdf}, url = {https://proceedings.mlr.press/v19/garivier11a.html}, abstract = {This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.} }
Endnote
%0 Conference Paper %T The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond %A Aurélien Garivier %A Olivier Cappé %B Proceedings of the 24th Annual Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2011 %E Sham M. Kakade %E Ulrike von Luxburg %F pmlr-v19-garivier11a %I PMLR %P 359--376 %U https://proceedings.mlr.press/v19/garivier11a.html %V 19 %X This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.
RIS
TY - CPAPER TI - The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond AU - Aurélien Garivier AU - Olivier Cappé BT - Proceedings of the 24th Annual Conference on Learning Theory DA - 2011/12/21 ED - Sham M. Kakade ED - Ulrike von Luxburg ID - pmlr-v19-garivier11a PB - PMLR DP - Proceedings of Machine Learning Research VL - 19 SP - 359 EP - 376 L1 - http://proceedings.mlr.press/v19/garivier11a/garivier11a.pdf UR - https://proceedings.mlr.press/v19/garivier11a.html AB - This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm. ER -
APA
Garivier, A. & Cappé, O.. (2011). The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. Proceedings of the 24th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 19:359-376 Available from https://proceedings.mlr.press/v19/garivier11a.html.

Related Material