An Optimal Algorithm for Stochastic and Adversarial Bandits

Julian Zimmert, Yevgeny Seldin
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:467-475, 2019.

Abstract

We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent with Tsallis entropy regularizer. We provide a complete characterization of such algorithms and show that Tsallis entropy with power $\alpha = 1/2$ achieves the goal. In addition, the proposed algorithm enjoys improved regret guarantees in two intermediate regimes: the moderately contaminated stochastic regime defined by Seldin and Slivkins [22] and the stochastically constrained adversary studied by Wei and Luo [26]. The algorithm also obtains adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it outperforms Ucb1 and Exp3 in stochastic environments. In certain adversarial regimes the algorithm significantly outperforms Ucb1 and Thompson Sampling, which exhibit close to linear regret.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-zimmert19a, title = {An Optimal Algorithm for Stochastic and Adversarial Bandits}, author = {Zimmert, Julian and Seldin, Yevgeny}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {467--475}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/zimmert19a/zimmert19a.pdf}, url = {https://proceedings.mlr.press/v89/zimmert19a.html}, abstract = {We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent with Tsallis entropy regularizer. We provide a complete characterization of such algorithms and show that Tsallis entropy with power $\alpha = 1/2$ achieves the goal. In addition, the proposed algorithm enjoys improved regret guarantees in two intermediate regimes: the moderately contaminated stochastic regime defined by Seldin and Slivkins [22] and the stochastically constrained adversary studied by Wei and Luo [26]. The algorithm also obtains adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it outperforms Ucb1 and Exp3 in stochastic environments. In certain adversarial regimes the algorithm significantly outperforms Ucb1 and Thompson Sampling, which exhibit close to linear regret.} }
Endnote
%0 Conference Paper %T An Optimal Algorithm for Stochastic and Adversarial Bandits %A Julian Zimmert %A Yevgeny Seldin %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-zimmert19a %I PMLR %P 467--475 %U https://proceedings.mlr.press/v89/zimmert19a.html %V 89 %X We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent with Tsallis entropy regularizer. We provide a complete characterization of such algorithms and show that Tsallis entropy with power $\alpha = 1/2$ achieves the goal. In addition, the proposed algorithm enjoys improved regret guarantees in two intermediate regimes: the moderately contaminated stochastic regime defined by Seldin and Slivkins [22] and the stochastically constrained adversary studied by Wei and Luo [26]. The algorithm also obtains adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it outperforms Ucb1 and Exp3 in stochastic environments. In certain adversarial regimes the algorithm significantly outperforms Ucb1 and Thompson Sampling, which exhibit close to linear regret.
APA
Zimmert, J. & Seldin, Y.. (2019). An Optimal Algorithm for Stochastic and Adversarial Bandits. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:467-475 Available from https://proceedings.mlr.press/v89/zimmert19a.html.

Related Material