[edit]
Learning Periodic Strategies in Blocking Bandits Is as Hard as Bandits with Switching Costs
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1001-1021, 2026.
Abstract
In blocking $K$-armed bandits, playing an arm renders it unavailable for a fixed number of future rounds. While this model is relatively well understood in the stochastic regime, much less is known when rewards are generated adversarially. Via a novel reduction, we first show that computing the total reward of the best dynamic policy is NP-hard, even when the blocking time $d > 1$ is identical across arms. We therefore turn to tractable comparators and study the class of $d$-periodic policies, proving that the optimal periodic policy is efficiently computable and always obtains at least a $\frac{1}{K}$ fraction of the dynamic optimum. We also show that this $\frac{1}{K}$ factor is information-theoretically tight: no algorithm can achieve sublinear $\alpha$-regret with respect to the offline optimal dynamic policy for any $\alpha > \frac{1}{K}$. Our main result shows that $T^{2/3}$ is the minimax rate for the regret (against periodic policies) for adversarial blocking bandits with identical blocking times, and that this rate is achievable by an efficient algorithm. Our main technical contribution is the lower bound, which establishes that blocking bandits are at least as hard as bandits with switching costs. The matching upper bound instead follows from a reduction to combinatorial semi-bandits over bipartite matchings. Finally, we show that $\sqrt{T}$ regret rates are efficiently achievable in the full information setting, and more generally via $\alpha$-regret with $\alpha = \frac{1}{2}$.