Beating Bandits in Gradually Evolving Worlds
Proceedings of the 26th Annual Conference on Learning Theory, PMLR 30:210-227, 2013.
Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.