[edit]
Bayesian Change-Point Detection for Bandit Feedback in Non-stationary Environments
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:17-31, 2023.
Abstract
The stochastic multi-armed bandit problem has been
widely studied under the stationary
assumption. However in real world problems and
industrial applications, this assumption is often
unrealistic because the distributions of rewards may
change over time. In this paper, we consider the
piece-wise iid non-stationary stochastic multi-armed
bandit problem with unknown change-points and we
focus on the change of mean setup. To solve the
latter, we propose a change-point based framework
where we study a class of change-detection based
optimal bandit policies that actively detects
change-point using the restarted Bayesian online
change-point detector and then restarts the bandit
indices. Analytically, in the context of regret
minimization, our proposal achieves a
$\mathcal{O}(\sqrt{A T K_T })$ regret upper-bound
where $K_T$ is the overall number of change-points
up to the horizon $T$ and $A$ is the number of
arms. The derived bound matches the existing lower
bound for abruptly changing environments. Finally,
we demonstrate the cumulative regret reduction of
the our proposal over synthetic Bernoulli rewards as
well as Yahoo! datasets of webpage click-through
rates.