Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for Bandit Problems

Junya Honda, Shinji Ito, Taira Tsuchiya
Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:726-754, 2023.

Abstract

This paper discusses the adversarial and stochastic $K$-armed bandit problems. In the adversarial setting, the best possible regret is known to be $O(\sqrt{KT})$ for time horizon $T$. This bound can be achieved by several policies but they require to explicitly compute the arm-selection probabilities by solving optimization problems at each round, which becomes problematic in some settings. One promising candidate to avoid this issue is the Follow-The-Perturbed-Leader (FTPL) policy, which simply chooses the arm with the minimum cumulative estimated loss with a random perturbation. In particular, it has been conjectured that $O(\sqrt{KT})$ regret might be achieved by FTPL with a Fréchet-type perturbation. This paper affirmatively resolves this conjecture by showing that Fréchet perturbation indeed achieves this bound. We also show that FTPL achieves a logarithmic regret for the stochastic setting, meaning that FTPL achieves best-of-both-worlds regret bounds. The key to these bounds is the novel technique to evaluate the stability of the policy and to express the regret at each round in multiple forms depending on the estimated losses.

Cite this Paper


BibTeX
@InProceedings{pmlr-v201-honda23a, title = {{Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for Bandit Problems}}, author = {Honda, Junya and Ito, Shinji and Tsuchiya, Taira}, booktitle = {Proceedings of The 34th International Conference on Algorithmic Learning Theory}, pages = {726--754}, year = {2023}, editor = {Agrawal, Shipra and Orabona, Francesco}, volume = {201}, series = {Proceedings of Machine Learning Research}, month = {20 Feb--23 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v201/honda23a/honda23a.pdf}, url = {https://proceedings.mlr.press/v201/honda23a.html}, abstract = {This paper discusses the adversarial and stochastic $K$-armed bandit problems. In the adversarial setting, the best possible regret is known to be $O(\sqrt{KT})$ for time horizon $T$. This bound can be achieved by several policies but they require to explicitly compute the arm-selection probabilities by solving optimization problems at each round, which becomes problematic in some settings. One promising candidate to avoid this issue is the Follow-The-Perturbed-Leader (FTPL) policy, which simply chooses the arm with the minimum cumulative estimated loss with a random perturbation. In particular, it has been conjectured that $O(\sqrt{KT})$ regret might be achieved by FTPL with a Fréchet-type perturbation. This paper affirmatively resolves this conjecture by showing that Fréchet perturbation indeed achieves this bound. We also show that FTPL achieves a logarithmic regret for the stochastic setting, meaning that FTPL achieves best-of-both-worlds regret bounds. The key to these bounds is the novel technique to evaluate the stability of the policy and to express the regret at each round in multiple forms depending on the estimated losses.} }
Endnote
%0 Conference Paper %T Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for Bandit Problems %A Junya Honda %A Shinji Ito %A Taira Tsuchiya %B Proceedings of The 34th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Shipra Agrawal %E Francesco Orabona %F pmlr-v201-honda23a %I PMLR %P 726--754 %U https://proceedings.mlr.press/v201/honda23a.html %V 201 %X This paper discusses the adversarial and stochastic $K$-armed bandit problems. In the adversarial setting, the best possible regret is known to be $O(\sqrt{KT})$ for time horizon $T$. This bound can be achieved by several policies but they require to explicitly compute the arm-selection probabilities by solving optimization problems at each round, which becomes problematic in some settings. One promising candidate to avoid this issue is the Follow-The-Perturbed-Leader (FTPL) policy, which simply chooses the arm with the minimum cumulative estimated loss with a random perturbation. In particular, it has been conjectured that $O(\sqrt{KT})$ regret might be achieved by FTPL with a Fréchet-type perturbation. This paper affirmatively resolves this conjecture by showing that Fréchet perturbation indeed achieves this bound. We also show that FTPL achieves a logarithmic regret for the stochastic setting, meaning that FTPL achieves best-of-both-worlds regret bounds. The key to these bounds is the novel technique to evaluate the stability of the policy and to express the regret at each round in multiple forms depending on the estimated losses.
APA
Honda, J., Ito, S. & Tsuchiya, T.. (2023). Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for Bandit Problems. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:726-754 Available from https://proceedings.mlr.press/v201/honda23a.html.

Related Material