[edit]
Smooth Non-stationary Bandits
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:14930-14944, 2023.
Abstract
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee T2/3 regret. However, in practice environments are often changing smoothly, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm’s mean reward is a β-Hölder function over (normalized) time, meaning it is (β−1)-times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with T3/5 regret for β=2. We complement this result by a Tβ+12β+1 lower bound for any integer β≥1, which matches our upper bound for β=2.