A simple multiarmed bandit algorithm with optimal variationbounded regret
[edit]
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:817820, 2011.
Abstract
We pose the question of whether it is possible to design a simple, lineartime algorithm for the basic multiarmed bandit problem in the adversarial setting which has a regret bound of O(\sqrtQ \log T), where Q is the total quadratic variation of all the arms.
Related Material



