Tsallis-INF for Decoupled Exploration and Exploitation in Multi-armed Bandits

Chloé Rouyer; Yevgeny Seldin

Tsallis-INF for Decoupled Exploration and Exploitation in Multi-armed Bandits

Chloé Rouyer, Yevgeny Seldin

Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:3227-3249, 2020.

Abstract

We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal

$O(\sqrt{KT})$ regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).

Cite this Paper

BibTeX


@InProceedings{pmlr-v125-rouyer20a,
  title = 	 {Tsallis-INF for Decoupled Exploration and Exploitation in Multi-armed Bandits},
  author =       {Rouyer, Chlo{\'e} and Seldin, Yevgeny},
  booktitle = 	 {Proceedings of Thirty Third Conference on Learning Theory},
  pages = 	 {3227--3249},
  year = 	 {2020},
  editor = 	 {Abernethy, Jacob and Agarwal, Shivani},
  volume = 	 {125},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v125/rouyer20a/rouyer20a.pdf},
  url = 	 {https://proceedings.mlr.press/v125/rouyer20a.html},
  abstract = 	 { We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal $O(\sqrt{KT})$ regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).}
}

Endnote

%0 Conference Paper
%T Tsallis-INF for Decoupled Exploration and Exploitation in Multi-armed Bandits
%A Chloé Rouyer
%A Yevgeny Seldin
%B Proceedings of Thirty Third Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2020
%E Jacob Abernethy
%E Shivani Agarwal	
%F pmlr-v125-rouyer20a
%I PMLR
%P 3227--3249
%U https://proceedings.mlr.press/v125/rouyer20a.html
%V 125
%X  We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal $O(\sqrt{KT})$ regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).

APA


Rouyer, C. & Seldin, Y.. (2020). Tsallis-INF for Decoupled Exploration and Exploitation in Multi-armed Bandits. Proceedings of Thirty Third Conference on Learning Theory, in Proceedings of Machine Learning Research 125:3227-3249 Available from https://proceedings.mlr.press/v125/rouyer20a.html.

Related Material

Download PDF