[edit]
A Simple Approach for Non-stationary Linear Bandits
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:746-755, 2020.
Abstract
This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let $T$ be the time horizon, $d$ be the dimension, and $P_T$ be the path-length that measures the fluctuation of the evolving unknown parameter, our approach enjoys an $\tilde{O}(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ dynamic regret, which is nearly optimal, matching the $\Omega(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach.