The Best of Both Worlds: Stochastic and Adversarial Bandits

Sébastien Bubeck; Aleksandrs Slivkins

The Best of Both Worlds: Stochastic and Adversarial Bandits

Sébastien Bubeck, Aleksandrs Slivkins

Proceedings of the 25th Annual Conference on Learning Theory, PMLR 23:42.1-42.23, 2012.

Abstract

We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the \emphO(√\emphn) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice” problem instances.

Cite this Paper

BibTeX


@InProceedings{pmlr-v23-bubeck12b,
  title = 	 {The Best of Both Worlds: Stochastic and Adversarial Bandits},
  author = 	 {Bubeck, Sébastien and Slivkins, Aleksandrs},
  booktitle = 	 {Proceedings of the 25th Annual Conference on Learning Theory},
  pages = 	 {42.1--42.23},
  year = 	 {2012},
  editor = 	 {Mannor, Shie and Srebro, Nathan and Williamson, Robert C.},
  volume = 	 {23},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Edinburgh, Scotland},
  month = 	 {25--27 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v23/bubeck12b/bubeck12b.pdf},
  url = 	 {https://proceedings.mlr.press/v23/bubeck12b.html},
  abstract = 	 {We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the \emphO(√\emphn) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice” problem instances.}
}

Endnote

%0 Conference Paper
%T The Best of Both Worlds: Stochastic and Adversarial Bandits
%A Sébastien Bubeck
%A Aleksandrs Slivkins
%B Proceedings of the 25th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2012
%E Shie Mannor
%E Nathan Srebro
%E Robert C. Williamson	
%F pmlr-v23-bubeck12b
%I PMLR
%P 42.1--42.23
%U https://proceedings.mlr.press/v23/bubeck12b.html
%V 23
%X We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the \emphO(√\emphn) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice” problem instances.

RIS


TY  - CPAPER
TI  - The Best of Both Worlds: Stochastic and Adversarial Bandits
AU  - Sébastien Bubeck
AU  - Aleksandrs Slivkins
BT  - Proceedings of the 25th Annual Conference on Learning Theory
DA  - 2012/06/16
ED  - Shie Mannor
ED  - Nathan Srebro
ED  - Robert C. Williamson	
ID  - pmlr-v23-bubeck12b
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 23
SP  - 42.1
EP  - 42.23
L1  - http://proceedings.mlr.press/v23/bubeck12b/bubeck12b.pdf
UR  - https://proceedings.mlr.press/v23/bubeck12b.html
AB  - We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the \emphO(√\emphn) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice” problem instances.
ER  -

APA


Bubeck, S. & Slivkins, A.. (2012). The Best of Both Worlds: Stochastic and Adversarial Bandits. Proceedings of the 25th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 23:42.1-42.23 Available from https://proceedings.mlr.press/v23/bubeck12b.html.

Related Material

Download PDF