Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

Haipeng Luo, Mengxiao Zhang, Peng Zhao, Zhi-Hua Zhou
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:3635-3684, 2022.

Abstract

We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The Corral algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\tilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is $\mbox{poly}(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\tilde{Ø}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-luo22b, title = {Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits}, author = {Luo, Haipeng and Zhang, Mengxiao and Zhao, Peng and Zhou, Zhi-Hua}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {3635--3684}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/luo22b/luo22b.pdf}, url = {https://proceedings.mlr.press/v178/luo22b.html}, abstract = {We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The Corral algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\tilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is $\mbox{poly}(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\tilde{Ø}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.} }
Endnote
%0 Conference Paper %T Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits %A Haipeng Luo %A Mengxiao Zhang %A Peng Zhao %A Zhi-Hua Zhou %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-luo22b %I PMLR %P 3635--3684 %U https://proceedings.mlr.press/v178/luo22b.html %V 178 %X We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The Corral algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\tilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is $\mbox{poly}(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\tilde{Ø}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.
APA
Luo, H., Zhang, M., Zhao, P. & Zhou, Z.. (2022). Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:3635-3684 Available from https://proceedings.mlr.press/v178/luo22b.html.

Related Material