Maximin Action Identification: A New Bandit Framework for Games

Aurélien Garivier, Emilie Kaufmann, Wouter M. Koolen
29th Annual Conference on Learning Theory, PMLR 49:1028-1050, 2016.

Abstract

We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower- and upper- confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v49-garivier16b, title = {Maximin Action Identification: A New Bandit Framework for Games}, author = {Garivier, Aurélien and Kaufmann, Emilie and Koolen, Wouter M.}, booktitle = {29th Annual Conference on Learning Theory}, pages = {1028--1050}, year = {2016}, editor = {Feldman, Vitaly and Rakhlin, Alexander and Shamir, Ohad}, volume = {49}, series = {Proceedings of Machine Learning Research}, address = {Columbia University, New York, New York, USA}, month = {23--26 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v49/garivier16b.pdf}, url = {https://proceedings.mlr.press/v49/garivier16b.html}, abstract = {We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower- and upper- confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.} }
Endnote
%0 Conference Paper %T Maximin Action Identification: A New Bandit Framework for Games %A Aurélien Garivier %A Emilie Kaufmann %A Wouter M. Koolen %B 29th Annual Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2016 %E Vitaly Feldman %E Alexander Rakhlin %E Ohad Shamir %F pmlr-v49-garivier16b %I PMLR %P 1028--1050 %U https://proceedings.mlr.press/v49/garivier16b.html %V 49 %X We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower- and upper- confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.
RIS
TY - CPAPER TI - Maximin Action Identification: A New Bandit Framework for Games AU - Aurélien Garivier AU - Emilie Kaufmann AU - Wouter M. Koolen BT - 29th Annual Conference on Learning Theory DA - 2016/06/06 ED - Vitaly Feldman ED - Alexander Rakhlin ED - Ohad Shamir ID - pmlr-v49-garivier16b PB - PMLR DP - Proceedings of Machine Learning Research VL - 49 SP - 1028 EP - 1050 L1 - http://proceedings.mlr.press/v49/garivier16b.pdf UR - https://proceedings.mlr.press/v49/garivier16b.html AB - We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower- and upper- confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm. ER -
APA
Garivier, A., Kaufmann, E. & Koolen, W.M.. (2016). Maximin Action Identification: A New Bandit Framework for Games. 29th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 49:1028-1050 Available from https://proceedings.mlr.press/v49/garivier16b.html.

Related Material