[edit]
Tracking Most Significant Arm Switches in Bandits
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:2160-2182, 2022.
Abstract
In \emph{bandit with distribution shifts}, one aims to automatically adapt to unknown changes in reward distribution, and \emph{restart} exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret √LT, for T rounds, and an unknown number L of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually \emph{severe}. To resolve this question, we propose a new notion of \emph{significant shift}, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than O(√ST), where S≪L only counts best arm switches, while at the same time, always faster than the optimal O(V13T23) when expressed in terms of \emph{total variation} V (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.