Best-of-K-bandits

Max Simchowitz; Kevin Jamieson; Benjamin Recht

Best-of-K-bandits

Max Simchowitz, Kevin Jamieson, Benjamin Recht

29th Annual Conference on Learning Theory, PMLR 49:1440-1489, 2016.

Abstract

This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.

Cite this Paper

BibTeX

@InProceedings{pmlr-v49-simchowitz16,
  title = 	 {Best-of-K-bandits},
  author = 	 {Simchowitz, Max and Jamieson, Kevin and Recht, Benjamin},
  booktitle = 	 {29th Annual Conference on Learning Theory},
  pages = 	 {1440--1489},
  year = 	 {2016},
  editor = 	 {Feldman, Vitaly and Rakhlin, Alexander and Shamir, Ohad},
  volume = 	 {49},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Columbia University, New York, New York, USA},
  month = 	 {23--26 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v49/simchowitz16.pdf},
  url = 	 {https://proceedings.mlr.press/v49/simchowitz16.html},
  abstract = 	 {This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds. }
}

Endnote

%0 Conference Paper
%T Best-of-K-bandits
%A Max Simchowitz
%A Kevin Jamieson
%A Benjamin Recht
%B 29th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2016
%E Vitaly Feldman
%E Alexander Rakhlin
%E Ohad Shamir	
%F pmlr-v49-simchowitz16
%I PMLR
%P 1440--1489
%U https://proceedings.mlr.press/v49/simchowitz16.html
%V 49
%X This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.

RIS

TY  - CPAPER
TI  - Best-of-K-bandits
AU  - Max Simchowitz
AU  - Kevin Jamieson
AU  - Benjamin Recht
BT  - 29th Annual Conference on Learning Theory
DA  - 2016/06/06
ED  - Vitaly Feldman
ED  - Alexander Rakhlin
ED  - Ohad Shamir	
ID  - pmlr-v49-simchowitz16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 49
SP  - 1440
EP  - 1489
L1  - http://proceedings.mlr.press/v49/simchowitz16.pdf
UR  - https://proceedings.mlr.press/v49/simchowitz16.html
AB  - This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds. 
ER  -

APA

Simchowitz, M., Jamieson, K. & Recht, B.. (2016). Best-of-K-bandits. 29th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 49:1440-1489 Available from https://proceedings.mlr.press/v49/simchowitz16.html.

Related Material

Download PDF