Beating Bandits in Gradually Evolving Worlds

Chao-Kai Chiang; Chia-Jung Lee; Chi-Jen Lu

Beating Bandits in Gradually Evolving Worlds

Chao-Kai Chiang, Chia-Jung Lee, Chi-Jen Lu

Proceedings of the 26th Annual Conference on Learning Theory, PMLR 30:210-227, 2013.

Abstract

Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.

Cite this Paper

BibTeX


@InProceedings{pmlr-v30-Chiang13,
  title = 	 {Beating Bandits in Gradually Evolving Worlds},
  author = 	 {Chiang, Chao-Kai and Lee, Chia-Jung and Lu, Chi-Jen},
  booktitle = 	 {Proceedings of the 26th Annual Conference on Learning Theory},
  pages = 	 {210--227},
  year = 	 {2013},
  editor = 	 {Shalev-Shwartz, Shai and Steinwart, Ingo},
  volume = 	 {30},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Princeton, NJ, USA},
  month = 	 {12--14 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v30/Chiang13.pdf},
  url = 	 {https://proceedings.mlr.press/v30/Chiang13.html},
  abstract = 	 {Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.}
}

Endnote

%0 Conference Paper
%T Beating Bandits in Gradually Evolving Worlds
%A Chao-Kai Chiang
%A Chia-Jung Lee
%A Chi-Jen Lu
%B Proceedings of the 26th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2013
%E Shai Shalev-Shwartz
%E Ingo Steinwart	
%F pmlr-v30-Chiang13
%I PMLR
%P 210--227
%U https://proceedings.mlr.press/v30/Chiang13.html
%V 30
%X Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.

RIS


TY  - CPAPER
TI  - Beating Bandits in Gradually Evolving Worlds
AU  - Chao-Kai Chiang
AU  - Chia-Jung Lee
AU  - Chi-Jen Lu
BT  - Proceedings of the 26th Annual Conference on Learning Theory
DA  - 2013/06/13
ED  - Shai Shalev-Shwartz
ED  - Ingo Steinwart	
ID  - pmlr-v30-Chiang13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 30
SP  - 210
EP  - 227
L1  - http://proceedings.mlr.press/v30/Chiang13.pdf
UR  - https://proceedings.mlr.press/v30/Chiang13.html
AB  - Consider the online convex optimization problem, in which a player has to choose actions iteratively and suffers corresponding losses according to some convex loss functions, and the goal is to minimize the regret. In the full-information setting, the player after choosing her action can observe the whole loss function in that round, while in the bandit setting, the only information the player can observe is the loss value of that action. Designing such bandit algorithms appears challenging, as the best regret currently achieved for general convex loss functions is much higher than that in the full-information setting, while for strongly convex loss functions, there is even a regret lower bound which is exponentially higher than that achieved in the full-information setting. To aim for smaller regrets, we adopt a relaxed two-point bandit setting in which the player can play two actions in each round and observe the loss values of those two actions. Moreover, we consider loss functions parameterized by their deviation D, which measures how fast they evolve, and we study how regrets depend on D. We show that two-point bandit algorithms can in fact achieve regrets matching those in the full-information setting in terms of D. More precisely, for convex loss functions, we achieve a regret of O(\sqrtD), while for strongly convex loss functions, we achieve a regret of O(\ln D), which is much smaller than the Ω(\sqrtD) lower bound in the traditional bandit setting.
ER  -

APA


Chiang, C., Lee, C. & Lu, C.. (2013). Beating Bandits in Gradually Evolving Worlds. Proceedings of the 26th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 30:210-227 Available from https://proceedings.mlr.press/v30/Chiang13.html.

Related Material

Download PDF