An Algorithm for Stochastic and Adversarial Bandits with Switching Costs

Chloé Rouyer, Yevgeny Seldin, Nicolò Cesa-Bianchi
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9127-9135, 2021.

Abstract

We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $\lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of $ O( (\lambda K)^{1/3}T^{2/3} + \sqrt{KT})$, where $T$ is the time horizon and $K$ is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of $O((\lambda K)^{2/3} T^{1/3} + \ln T)\sum_{i \neq i^*} \Delta_i^{-1})$, where $\Delta_i$ are suboptimality gaps and $i^*$ is the unique optimal arm. In the special case of $\lambda = 0$ (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-rouyer21a, title = {An Algorithm for Stochastic and Adversarial Bandits with Switching Costs}, author = {Rouyer, Chlo{\'e} and Seldin, Yevgeny and Cesa-Bianchi, Nicol{\`o}}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {9127--9135}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/rouyer21a/rouyer21a.pdf}, url = {http://proceedings.mlr.press/v139/rouyer21a.html}, abstract = {We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $\lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of $ O( (\lambda K)^{1/3}T^{2/3} + \sqrt{KT})$, where $T$ is the time horizon and $K$ is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of $O((\lambda K)^{2/3} T^{1/3} + \ln T)\sum_{i \neq i^*} \Delta_i^{-1})$, where $\Delta_i$ are suboptimality gaps and $i^*$ is the unique optimal arm. In the special case of $\lambda = 0$ (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.} }
Endnote
%0 Conference Paper %T An Algorithm for Stochastic and Adversarial Bandits with Switching Costs %A Chloé Rouyer %A Yevgeny Seldin %A Nicolò Cesa-Bianchi %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-rouyer21a %I PMLR %P 9127--9135 %U http://proceedings.mlr.press/v139/rouyer21a.html %V 139 %X We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $\lambda$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of $ O( (\lambda K)^{1/3}T^{2/3} + \sqrt{KT})$, where $T$ is the time horizon and $K$ is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of $O((\lambda K)^{2/3} T^{1/3} + \ln T)\sum_{i \neq i^*} \Delta_i^{-1})$, where $\Delta_i$ are suboptimality gaps and $i^*$ is the unique optimal arm. In the special case of $\lambda = 0$ (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost.
APA
Rouyer, C., Seldin, Y. & Cesa-Bianchi, N.. (2021). An Algorithm for Stochastic and Adversarial Bandits with Switching Costs. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:9127-9135 Available from http://proceedings.mlr.press/v139/rouyer21a.html.

Related Material