[edit]
Scale-Free Adversarial Multi Armed Bandits
Proceedings of The 33rd International Conference on Algorithmic Learning Theory, PMLR 167:910-930, 2022.
Abstract
We consider the Scale-Free Adversarial Multi Armed Bandits(MAB) problem. At the beginning of the game, the player only knows the number of arms n. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds T. In each round, it sees bandit feedback about the loss vectors l1,…,lT∈Rn. The goal is to bound its regret as a function of n and norms of l1,…,lT. We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses an adaptive learning rate and give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of ˜O(√nL2+L∞√nT) and with adaptive exploration, it has a regret of ˜O(√nL2+L∞√nL1). Here L∞=sup, L_2 = \sum_{t=1}^T \|l_t\|_2^2, L_1 = \sum_{t=1}^T \|l_t\|_1 and the \tilde{\mathcal{O}} notation suppress logarithmic factors. These are the first MAB bounds that adapt to the \|\cdot\|_2, \|\cdot\|_1 norms of the losses. The second bound is the first data-dependent scale-free MAB bound as T does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses. This technique could be of independent interest.