Stochastic Multiarmed Bandits in Constant Space
[edit]
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, PMLR 84:386394, 2018.
Abstract
We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the winloss record for all $K$ arms. We give an algorithm using $O(1)$ words of space with regret $\sum_{i=1}^{K}\frac{1}{\Delta_i}\log \frac{\Delta_i}{∆}\log T$ where $\Delta_i$ is the gap between the best arm and arm $i$ and $∆$ is the gap between the best and the secondbest arms. If the rewards are bounded away from $0$ and $1$, this is within an $O(\log (1/∆))$ factor of the optimum regret possible without space constraints.
Related Material


