Learning Periodic Strategies in Blocking Bandits Is as Hard as Bandits with Switching Costs

Nicolò Cesa-Bianchi; Junya Honda; Yuko Kuroki; Atsushi Miyauchi; Lukas Zierahn

Learning Periodic Strategies in Blocking Bandits Is as Hard as Bandits with Switching Costs

Nicolò Cesa-Bianchi, Junya Honda, Yuko Kuroki, Atsushi Miyauchi, Lukas Zierahn

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1001-1021, 2026.

Abstract

In blocking $K$-armed bandits, playing an arm renders it unavailable for a fixed number of future rounds. While this model is relatively well understood in the stochastic regime, much less is known when rewards are generated adversarially. Via a novel reduction, we first show that computing the total reward of the best dynamic policy is NP-hard, even when the blocking time $d > 1$ is identical across arms. We therefore turn to tractable comparators and study the class of $d$-periodic policies, proving that the optimal periodic policy is efficiently computable and always obtains at least a $\frac{1}{K}$ fraction of the dynamic optimum. We also show that this $\frac{1}{K}$ factor is information-theoretically tight: no algorithm can achieve sublinear $\alpha$-regret with respect to the offline optimal dynamic policy for any $\alpha > \frac{1}{K}$. Our main result shows that $T^{2/3}$ is the minimax rate for the regret (against periodic policies) for adversarial blocking bandits with identical blocking times, and that this rate is achievable by an efficient algorithm. Our main technical contribution is the lower bound, which establishes that blocking bandits are at least as hard as bandits with switching costs. The matching upper bound instead follows from a reduction to combinatorial semi-bandits over bipartite matchings. Finally, we show that $\sqrt{T}$ regret rates are efficiently achievable in the full information setting, and more generally via $\alpha$-regret with $\alpha = \frac{1}{2}$.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-cesa-bianchi26a,
  title = 	 {Learning Periodic Strategies in Blocking Bandits Is as Hard as Bandits with Switching Costs},
  author =       {Cesa-Bianchi, Nicol\`{o} and Honda, Junya and Kuroki, Yuko and Miyauchi, Atsushi and Zierahn, Lukas},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {1001--1021},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/cesa-bianchi26a/cesa-bianchi26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/cesa-bianchi26a.html},
  abstract = 	 {In blocking $K$-armed bandits, playing an arm renders it unavailable for a fixed number of future rounds. While this model is relatively well understood in the stochastic regime, much less is known when rewards are generated adversarially. Via a novel reduction, we first show that computing the total reward of the best dynamic policy is NP-hard, even when the blocking time $d > 1$ is identical across arms. We therefore turn to tractable comparators and study the class of $d$-periodic policies, proving that the optimal periodic policy is efficiently computable and always obtains at least a $\frac{1}{K}$ fraction of the dynamic optimum. We also show that this $\frac{1}{K}$ factor is information-theoretically tight: no algorithm can achieve sublinear $\alpha$-regret with respect to the offline optimal dynamic policy for any $\alpha > \frac{1}{K}$. Our main result shows that $T^{2/3}$ is the minimax rate for the regret (against periodic policies) for adversarial blocking bandits with identical blocking times, and that this rate is achievable by an efficient algorithm. Our main technical contribution is the lower bound, which establishes that blocking bandits are at least as hard as bandits with switching costs. The matching upper bound instead follows from a reduction to combinatorial semi-bandits over bipartite matchings. Finally, we show that $\sqrt{T}$ regret rates are efficiently achievable in the full information setting, and more generally via $\alpha$-regret with $\alpha = \frac{1}{2}$.}
}

Endnote

%0 Conference Paper
%T Learning Periodic Strategies in Blocking Bandits Is as Hard as Bandits with Switching Costs
%A Nicolò Cesa-Bianchi
%A Junya Honda
%A Yuko Kuroki
%A Atsushi Miyauchi
%A Lukas Zierahn
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-cesa-bianchi26a
%I PMLR
%P 1001--1021
%U https://proceedings.mlr.press/v336/cesa-bianchi26a.html
%V 336
%X In blocking $K$-armed bandits, playing an arm renders it unavailable for a fixed number of future rounds. While this model is relatively well understood in the stochastic regime, much less is known when rewards are generated adversarially. Via a novel reduction, we first show that computing the total reward of the best dynamic policy is NP-hard, even when the blocking time $d > 1$ is identical across arms. We therefore turn to tractable comparators and study the class of $d$-periodic policies, proving that the optimal periodic policy is efficiently computable and always obtains at least a $\frac{1}{K}$ fraction of the dynamic optimum. We also show that this $\frac{1}{K}$ factor is information-theoretically tight: no algorithm can achieve sublinear $\alpha$-regret with respect to the offline optimal dynamic policy for any $\alpha > \frac{1}{K}$. Our main result shows that $T^{2/3}$ is the minimax rate for the regret (against periodic policies) for adversarial blocking bandits with identical blocking times, and that this rate is achievable by an efficient algorithm. Our main technical contribution is the lower bound, which establishes that blocking bandits are at least as hard as bandits with switching costs. The matching upper bound instead follows from a reduction to combinatorial semi-bandits over bipartite matchings. Finally, we show that $\sqrt{T}$ regret rates are efficiently achievable in the full information setting, and more generally via $\alpha$-regret with $\alpha = \frac{1}{2}$.

APA

Cesa-Bianchi, N., Honda, J., Kuroki, Y., Miyauchi, A. & Zierahn, L.. (2026). Learning Periodic Strategies in Blocking Bandits Is as Hard as Bandits with Switching Costs. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:1001-1021 Available from https://proceedings.mlr.press/v336/cesa-bianchi26a.html.

Related Material

Download PDF