[edit]
Stochastic Multi-armed Bandits in Constant Space
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84:386-394, 2018.
Abstract
We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all K arms. We give an algorithm using O(1) words of space with regret \sum_{i=1}^{K}\frac{1}{\Delta_i}\log \frac{\Delta_i}{∆}\log T where \Delta_i is the gap between the best arm and arm i and ∆ is the gap between the best and the second-best arms. If the rewards are bounded away from 0 and 1, this is within an O(\log (1/∆)) factor of the optimum regret possible without space constraints.