Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching

Quan Nguyen, Shinji Ito, Junpei Komiyama, Mehta Nishant
Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:4386-4451, 2025.

Abstract

Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v291-nguyen25a, title = {Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching}, author = {Nguyen, Quan and Ito, Shinji and Komiyama, Junpei and Nishant, Mehta}, booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory}, pages = {4386--4451}, year = {2025}, editor = {Haghtalab, Nika and Moitra, Ankur}, volume = {291}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--04 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v291/main/assets/nguyen25a/nguyen25a.pdf}, url = {https://proceedings.mlr.press/v291/nguyen25a.html}, abstract = { Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.} }
Endnote
%0 Conference Paper %T Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching %A Quan Nguyen %A Shinji Ito %A Junpei Komiyama %A Mehta Nishant %B Proceedings of Thirty Eighth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Nika Haghtalab %E Ankur Moitra %F pmlr-v291-nguyen25a %I PMLR %P 4386--4451 %U https://proceedings.mlr.press/v291/nguyen25a.html %V 291 %X Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.
APA
Nguyen, Q., Ito, S., Komiyama, J. & Nishant, M.. (2025). Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:4386-4451 Available from https://proceedings.mlr.press/v291/nguyen25a.html.

Related Material