[edit]
A simple multi-armed bandit algorithm with optimal variation-bounded regret
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:817-820, 2011.
Abstract
We pose the question of whether it is possible to design a simple, linear-time algorithm for the basic multi-armed bandit problem in the adversarial setting which has a regret bound of $O(\sqrt{Q \log T})$, where $Q$ is the total quadratic variation of all the arms.