The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

Aurélien Garivier; Olivier Cappé

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

Aurélien Garivier, Olivier Cappé

Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:359-376, 2011.

Abstract

This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.

Cite this Paper

BibTeX


@InProceedings{pmlr-v19-garivier11a,
  title = 	 {The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond},
  author = 	 {Garivier, Aurélien and Cappé, Olivier},
  booktitle = 	 {Proceedings of the 24th Annual Conference on Learning Theory},
  pages = 	 {359--376},
  year = 	 {2011},
  editor = 	 {Kakade, Sham M. and von Luxburg, Ulrike},
  volume = 	 {19},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Budapest, Hungary},
  month = 	 {09--11 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v19/garivier11a/garivier11a.pdf},
  url = 	 {https://proceedings.mlr.press/v19/garivier11a.html},
  abstract = 	 {This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free  index policy for stochastic bandit problems.  We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm  satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of  Bernoulli rewards, it reaches the lower bound of Lai and Robbins.  Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for  specific classes of (possibly unbounded) rewards, including those generated from exponential  families of distributions.  A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS,  UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better  than the basic UCB policy.  Our regret bounds rely on deviations results of independent interest which are stated and proved  in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB  algorithm.}
}

Endnote

%0 Conference Paper
%T The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
%A Aurélien Garivier
%A Olivier Cappé
%B Proceedings of the 24th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2011
%E Sham M. Kakade
%E Ulrike von Luxburg	
%F pmlr-v19-garivier11a
%I PMLR
%P 359--376
%U https://proceedings.mlr.press/v19/garivier11a.html
%V 19
%X This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free  index policy for stochastic bandit problems.  We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm  satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of  Bernoulli rewards, it reaches the lower bound of Lai and Robbins.  Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for  specific classes of (possibly unbounded) rewards, including those generated from exponential  families of distributions.  A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS,  UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better  than the basic UCB policy.  Our regret bounds rely on deviations results of independent interest which are stated and proved  in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB  algorithm.

RIS


TY  - CPAPER
TI  - The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
AU  - Aurélien Garivier
AU  - Olivier Cappé
BT  - Proceedings of the 24th Annual Conference on Learning Theory
DA  - 2011/12/21
ED  - Sham M. Kakade
ED  - Ulrike von Luxburg	
ID  - pmlr-v19-garivier11a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 19
SP  - 359
EP  - 376
L1  - http://proceedings.mlr.press/v19/garivier11a/garivier11a.pdf
UR  - https://proceedings.mlr.press/v19/garivier11a.html
AB  - This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free  index policy for stochastic bandit problems.  We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm  satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of  Bernoulli rewards, it reaches the lower bound of Lai and Robbins.  Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for  specific classes of (possibly unbounded) rewards, including those generated from exponential  families of distributions.  A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS,  UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better  than the basic UCB policy.  Our regret bounds rely on deviations results of independent interest which are stated and proved  in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB  algorithm.
ER  -

APA


Garivier, A. & Cappé, O.. (2011). The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. Proceedings of the 24th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 19:359-376 Available from https://proceedings.mlr.press/v19/garivier11a.html.

Related Material

Download PDF