Scale-Free Adversarial Multi Armed Bandits

Sudeep Raja Putta, Shipra Agrawal
Proceedings of The 33rd International Conference on Algorithmic Learning Theory, PMLR 167:910-930, 2022.

Abstract

We consider the Scale-Free Adversarial Multi Armed Bandits(MAB) problem. At the beginning of the game, the player only knows the number of arms n. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds T. In each round, it sees bandit feedback about the loss vectors l1,,lTRn. The goal is to bound its regret as a function of n and norms of l1,,lT. We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses an adaptive learning rate and give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of ˜O(nL2+LnT) and with adaptive exploration, it has a regret of ˜O(nL2+LnL1). Here L=sup, L_2 = \sum_{t=1}^T \|l_t\|_2^2, L_1 = \sum_{t=1}^T \|l_t\|_1 and the \tilde{\mathcal{O}} notation suppress logarithmic factors. These are the first MAB bounds that adapt to the \|\cdot\|_2, \|\cdot\|_1 norms of the losses. The second bound is the first data-dependent scale-free MAB bound as T does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses. This technique could be of independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v167-putta22a, title = {Scale-Free Adversarial Multi Armed Bandits}, author = {Putta, Sudeep Raja and Agrawal, Shipra}, booktitle = {Proceedings of The 33rd International Conference on Algorithmic Learning Theory}, pages = {910--930}, year = {2022}, editor = {Dasgupta, Sanjoy and Haghtalab, Nika}, volume = {167}, series = {Proceedings of Machine Learning Research}, month = {29 Mar--01 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v167/putta22a/putta22a.pdf}, url = {https://proceedings.mlr.press/v167/putta22a.html}, abstract = {We consider the Scale-Free Adversarial Multi Armed Bandits(MAB) problem. At the beginning of the game, the player only knows the number of arms $n$. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds $T$. In each round, it sees bandit feedback about the loss vectors $l_1,…, l_T \in \mathbb{R}^n$. The goal is to bound its regret as a function of $n$ and norms of $l_1,…, l_T$. We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses an adaptive learning rate and give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of $\tilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nT})$ and with adaptive exploration, it has a regret of $\tilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nL_1})$. Here $L_\infty = \sup_t \| l_t\|_\infty$, $L_2 = \sum_{t=1}^T \|l_t\|_2^2$, $L_1 = \sum_{t=1}^T \|l_t\|_1$ and the $\tilde{\mathcal{O}}$ notation suppress logarithmic factors. These are the first MAB bounds that adapt to the $\|\cdot\|_2$, $\|\cdot\|_1$ norms of the losses. The second bound is the first data-dependent scale-free MAB bound as $T$ does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses. This technique could be of independent interest.} }
Endnote
%0 Conference Paper %T Scale-Free Adversarial Multi Armed Bandits %A Sudeep Raja Putta %A Shipra Agrawal %B Proceedings of The 33rd International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Sanjoy Dasgupta %E Nika Haghtalab %F pmlr-v167-putta22a %I PMLR %P 910--930 %U https://proceedings.mlr.press/v167/putta22a.html %V 167 %X We consider the Scale-Free Adversarial Multi Armed Bandits(MAB) problem. At the beginning of the game, the player only knows the number of arms $n$. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds $T$. In each round, it sees bandit feedback about the loss vectors $l_1,…, l_T \in \mathbb{R}^n$. The goal is to bound its regret as a function of $n$ and norms of $l_1,…, l_T$. We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses an adaptive learning rate and give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of $\tilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nT})$ and with adaptive exploration, it has a regret of $\tilde{\mathcal{O}}(\sqrt{nL_2} + L_\infty\sqrt{nL_1})$. Here $L_\infty = \sup_t \| l_t\|_\infty$, $L_2 = \sum_{t=1}^T \|l_t\|_2^2$, $L_1 = \sum_{t=1}^T \|l_t\|_1$ and the $\tilde{\mathcal{O}}$ notation suppress logarithmic factors. These are the first MAB bounds that adapt to the $\|\cdot\|_2$, $\|\cdot\|_1$ norms of the losses. The second bound is the first data-dependent scale-free MAB bound as $T$ does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses. This technique could be of independent interest.
APA
Putta, S.R. & Agrawal, S.. (2022). Scale-Free Adversarial Multi Armed Bandits. Proceedings of The 33rd International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 167:910-930 Available from https://proceedings.mlr.press/v167/putta22a.html.

Related Material