PAC Battling Bandits in the Plackett-Luce Model

Aadirupa Saha; Aditya Gopalan

PAC Battling Bandits in the Plackett-Luce Model

Aadirupa Saha, Aditya Gopalan

Proceedings of the 30th International Conference on Algorithmic Learning Theory, PMLR 98:700-737, 2019.

Abstract

We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of

$k$ arms from a fixed set of

$n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top

$m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over

$n$ arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require

$O(\frac{n}{\epsilon^2} \ln \frac{1}{\delta})$ sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-

$m$ items (TR) for

$2\le m \le k$ . We show, surprisingly, that with winner information (WI) feedback over subsets of size

$2 \leq k \leq n$ , the best achievable sample complexity is still

$O\left( \frac{n}{\epsilon^2} \ln \frac{1}{\delta}\right)$ , independent of

$k$ , and the same as that in the Dueling Bandit setting (

$k=2$ ). For the more general top-

$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of

$\Omega\bigg( \frac{n}{m\epsilon^2} \ln \frac{1}{\delta}\bigg)$ , which suggests a multiplicative reduction by a factor

${m}$ owing to the additional information revealed from preferences among

$m$ items instead of just

$1$ . We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.

Cite this Paper

BibTeX


@InProceedings{pmlr-v98-saha19a,
  title = 	 {PAC Battling Bandits in the Plackett-Luce Model},
  author =       {Saha, Aadirupa and Gopalan, Aditya},
  booktitle = 	 {Proceedings of the 30th International Conference on Algorithmic Learning Theory},
  pages = 	 {700--737},
  year = 	 {2019},
  editor = 	 {Garivier, Aurélien and Kale, Satyen},
  volume = 	 {98},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {22--24 Mar},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v98/saha19a/saha19a.pdf},
  url = 	 {https://proceedings.mlr.press/v98/saha19a.html},
  abstract = 	 {We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over $n$ arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require $O(\frac{n}{\epsilon^2} \ln \frac{1}{\delta})$ sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-$m$ items (TR) for $2\le m \le k$. We show, surprisingly, that with winner information (WI) feedback over subsets of size $2 \leq k \leq n$, the best achievable sample complexity is still $O\left( \frac{n}{\epsilon^2} \ln \frac{1}{\delta}\right)$, independent of $k$, and the same as that in the Dueling Bandit setting ($k=2$). For the more general top-$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of $\Omega\bigg( \frac{n}{m\epsilon^2} \ln \frac{1}{\delta}\bigg)$, which suggests a multiplicative reduction by a factor ${m}$ owing to the additional information revealed from preferences among $m$ items instead of just $1$. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.}
}

Endnote

%0 Conference Paper
%T PAC Battling Bandits in the Plackett-Luce Model
%A Aadirupa Saha
%A Aditya Gopalan
%B Proceedings of the 30th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2019
%E Aurélien Garivier
%E Satyen Kale	
%F pmlr-v98-saha19a
%I PMLR
%P 700--737
%U https://proceedings.mlr.press/v98/saha19a.html
%V 98
%X We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over $n$ arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require $O(\frac{n}{\epsilon^2} \ln \frac{1}{\delta})$ sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-$m$ items (TR) for $2\le m \le k$. We show, surprisingly, that with winner information (WI) feedback over subsets of size $2 \leq k \leq n$, the best achievable sample complexity is still $O\left( \frac{n}{\epsilon^2} \ln \frac{1}{\delta}\right)$, independent of $k$, and the same as that in the Dueling Bandit setting ($k=2$). For the more general top-$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of $\Omega\bigg( \frac{n}{m\epsilon^2} \ln \frac{1}{\delta}\bigg)$, which suggests a multiplicative reduction by a factor ${m}$ owing to the additional information revealed from preferences among $m$ items instead of just $1$. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.

APA


Saha, A. & Gopalan, A.. (2019). PAC Battling Bandits in the Plackett-Luce Model. Proceedings of the 30th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 98:700-737 Available from https://proceedings.mlr.press/v98/saha19a.html.

Related Material

Download PDF