[edit]
A Simple Approach for Non-stationary Linear Bandits
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:746-755, 2020.
Abstract
This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let T be the time horizon, d be the dimension, and PT be the path-length that measures the fluctuation of the evolving unknown parameter, our approach enjoys an ˜O(d2/3(1+PT)1/3T2/3) dynamic regret, which is nearly optimal, matching the Ω(d2/3(1+PT)1/3T2/3) minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach.