Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes
Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:138-158, 2019.
We consider the variant of the stochastic multi-armed bandit problem where the stochastic reward distributions may change abruptly several times. In contrast to previous work, we are able to achieve (nearly) optimal mini-max regret bounds without knowing the number of changes. For this setting, we propose an algorithm called ADSWITCH and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. Our regret bound is the first optimal bound for an algorithm that is not tuned with respect to the number of changes.