PAC Battling Bandits in the Plackett-Luce Model

Aadirupa Saha, Aditya Gopalan
Proceedings of the 30th International Conference on Algorithmic Learning Theory, PMLR 98:700-737, 2019.

Abstract

We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of k arms from a fixed set of n arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top m most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over n arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require O(nϵ2ln1δ) sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-m items (TR) for 2mk. We show, surprisingly, that with winner information (WI) feedback over subsets of size 2kn, the best achievable sample complexity is still O(nϵ2ln1δ), independent of k, and the same as that in the Dueling Bandit setting (k=2). For the more general top-m ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of Ω(nmϵ2ln1δ), which suggests a multiplicative reduction by a factor m owing to the additional information revealed from preferences among m items instead of just 1. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.

Cite this Paper


BibTeX
@InProceedings{pmlr-v98-saha19a, title = {PAC Battling Bandits in the Plackett-Luce Model}, author = {Saha, Aadirupa and Gopalan, Aditya}, booktitle = {Proceedings of the 30th International Conference on Algorithmic Learning Theory}, pages = {700--737}, year = {2019}, editor = {Garivier, Aurélien and Kale, Satyen}, volume = {98}, series = {Proceedings of Machine Learning Research}, month = {22--24 Mar}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v98/saha19a/saha19a.pdf}, url = {https://proceedings.mlr.press/v98/saha19a.html}, abstract = {We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over $n$ arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require $O(\frac{n}{\epsilon^2} \ln \frac{1}{\delta})$ sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-$m$ items (TR) for $2\le m \le k$. We show, surprisingly, that with winner information (WI) feedback over subsets of size $2 \leq k \leq n$, the best achievable sample complexity is still $O\left( \frac{n}{\epsilon^2} \ln \frac{1}{\delta}\right)$, independent of $k$, and the same as that in the Dueling Bandit setting ($k=2$). For the more general top-$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of $\Omega\bigg( \frac{n}{m\epsilon^2} \ln \frac{1}{\delta}\bigg)$, which suggests a multiplicative reduction by a factor ${m}$ owing to the additional information revealed from preferences among $m$ items instead of just $1$. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.} }
Endnote
%0 Conference Paper %T PAC Battling Bandits in the Plackett-Luce Model %A Aadirupa Saha %A Aditya Gopalan %B Proceedings of the 30th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2019 %E Aurélien Garivier %E Satyen Kale %F pmlr-v98-saha19a %I PMLR %P 700--737 %U https://proceedings.mlr.press/v98/saha19a.html %V 98 %X We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model–an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over $n$ arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require $O(\frac{n}{\epsilon^2} \ln \frac{1}{\delta})$ sample complexity. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-$m$ items (TR) for $2\le m \le k$. We show, surprisingly, that with winner information (WI) feedback over subsets of size $2 \leq k \leq n$, the best achievable sample complexity is still $O\left( \frac{n}{\epsilon^2} \ln \frac{1}{\delta}\right)$, independent of $k$, and the same as that in the Dueling Bandit setting ($k=2$). For the more general top-$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of $\Omega\bigg( \frac{n}{m\epsilon^2} \ln \frac{1}{\delta}\bigg)$, which suggests a multiplicative reduction by a factor ${m}$ owing to the additional information revealed from preferences among $m$ items instead of just $1$. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.
APA
Saha, A. & Gopalan, A.. (2019). PAC Battling Bandits in the Plackett-Luce Model. Proceedings of the 30th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 98:700-737 Available from https://proceedings.mlr.press/v98/saha19a.html.

Related Material