More Adaptive Algorithms for Adversarial Bandits
[edit]
Proceedings of the 31st Conference On Learning Theory, PMLR 75:12631291, 2018.
Abstract
We develop a novel and generic algorithm for the adversarial multiarmed bandit problem (or more generally the combinatorial semibandit problem). When instantiated differently, our algorithm achieves various new datadependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the firstorder pathlength of only the best arm; 3) a regret bound depending on the sum of the firstorder pathlengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameterfree. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the wellknown Online Mirror Descent framework with a special logbarrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.
Related Material


