Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds

Shinji Ito
Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:2552-2583, 2021.

Abstract

This paper presents multi-armed bandit (MAB) algorithms that work well in adversarial environments and that offer improved performance by exploiting inherent structures in such environments, as stochastic generative models, as well as small variations in loss vectors. The fundamental aim of this work is to overcome the limitation of worst-case analyses in MAB contexts. There can be found two basic approaches achieving this purpose: best-of-both-worlds algorithms that work well for both stochastic and adversarial settings, and data-dependent regret bounds that work well depending on certain difficulty indicators w.r.g. loss sequences. One remarkable study w.r.t. the best-of-both-worlds approach deals with the Tsallis-INF algorithm \citep{zimmert2019optimal}, which achieves nearly optimal regret bounds up to small constants in both settings, though such bounds have remained unproven for a special case of a stochastic setting with multiple optimal arms. This paper offers two particular contributions: (i) We show that the Tsallis-INF algorithm enjoys a regret bound of a logarithmic order in the number of rounds for stochastic environments, even if the best arm is not unique. (ii) We provide a new algorithm with a new \textit{hybrid} regret bound that implies logarithmic regret in the stochastic regime and multiple data-dependent regret bounds in the adversarial regime, including bounds dependent on cumulative loss, total variation, and loss-sequence path-length. Both our proposed algorithm and the Tsallis-INF algorithm are based on a follow-the-regularized-leader (FTRL) framework with a time-varying regularizer. The analyses in this paper rely on \textit{skewed Bregman divergence}, which provides simple expressions of regret bounds for FTRL with a time-varying regularizer.

Cite this Paper


BibTeX
@InProceedings{pmlr-v134-ito21a, title = {Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds}, author = {Ito, Shinji}, booktitle = {Proceedings of Thirty Fourth Conference on Learning Theory}, pages = {2552--2583}, year = {2021}, editor = {Belkin, Mikhail and Kpotufe, Samory}, volume = {134}, series = {Proceedings of Machine Learning Research}, month = {15--19 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v134/ito21a/ito21a.pdf}, url = {https://proceedings.mlr.press/v134/ito21a.html}, abstract = {This paper presents multi-armed bandit (MAB) algorithms that work well in adversarial environments and that offer improved performance by exploiting inherent structures in such environments, as stochastic generative models, as well as small variations in loss vectors. The fundamental aim of this work is to overcome the limitation of worst-case analyses in MAB contexts. There can be found two basic approaches achieving this purpose: best-of-both-worlds algorithms that work well for both stochastic and adversarial settings, and data-dependent regret bounds that work well depending on certain difficulty indicators w.r.g. loss sequences. One remarkable study w.r.t. the best-of-both-worlds approach deals with the Tsallis-INF algorithm \citep{zimmert2019optimal}, which achieves nearly optimal regret bounds up to small constants in both settings, though such bounds have remained unproven for a special case of a stochastic setting with multiple optimal arms. This paper offers two particular contributions: (i) We show that the Tsallis-INF algorithm enjoys a regret bound of a logarithmic order in the number of rounds for stochastic environments, even if the best arm is not unique. (ii) We provide a new algorithm with a new \textit{hybrid} regret bound that implies logarithmic regret in the stochastic regime and multiple data-dependent regret bounds in the adversarial regime, including bounds dependent on cumulative loss, total variation, and loss-sequence path-length. Both our proposed algorithm and the Tsallis-INF algorithm are based on a follow-the-regularized-leader (FTRL) framework with a time-varying regularizer. The analyses in this paper rely on \textit{skewed Bregman divergence}, which provides simple expressions of regret bounds for FTRL with a time-varying regularizer.} }
Endnote
%0 Conference Paper %T Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds %A Shinji Ito %B Proceedings of Thirty Fourth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2021 %E Mikhail Belkin %E Samory Kpotufe %F pmlr-v134-ito21a %I PMLR %P 2552--2583 %U https://proceedings.mlr.press/v134/ito21a.html %V 134 %X This paper presents multi-armed bandit (MAB) algorithms that work well in adversarial environments and that offer improved performance by exploiting inherent structures in such environments, as stochastic generative models, as well as small variations in loss vectors. The fundamental aim of this work is to overcome the limitation of worst-case analyses in MAB contexts. There can be found two basic approaches achieving this purpose: best-of-both-worlds algorithms that work well for both stochastic and adversarial settings, and data-dependent regret bounds that work well depending on certain difficulty indicators w.r.g. loss sequences. One remarkable study w.r.t. the best-of-both-worlds approach deals with the Tsallis-INF algorithm \citep{zimmert2019optimal}, which achieves nearly optimal regret bounds up to small constants in both settings, though such bounds have remained unproven for a special case of a stochastic setting with multiple optimal arms. This paper offers two particular contributions: (i) We show that the Tsallis-INF algorithm enjoys a regret bound of a logarithmic order in the number of rounds for stochastic environments, even if the best arm is not unique. (ii) We provide a new algorithm with a new \textit{hybrid} regret bound that implies logarithmic regret in the stochastic regime and multiple data-dependent regret bounds in the adversarial regime, including bounds dependent on cumulative loss, total variation, and loss-sequence path-length. Both our proposed algorithm and the Tsallis-INF algorithm are based on a follow-the-regularized-leader (FTRL) framework with a time-varying regularizer. The analyses in this paper rely on \textit{skewed Bregman divergence}, which provides simple expressions of regret bounds for FTRL with a time-varying regularizer.
APA
Ito, S.. (2021). Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds. Proceedings of Thirty Fourth Conference on Learning Theory, in Proceedings of Machine Learning Research 134:2552-2583 Available from https://proceedings.mlr.press/v134/ito21a.html.

Related Material