Thompson Sampling in Switching Environments with Bayesian Online Change Detection


Joseph Mellor, Jonathan Shapiro ;
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, PMLR 31:442-450, 2013.


Thompson Sampling has recently been shown to achieve the lower bound on regret in the Bernoulli Multi-Armed Bandit setting. This bandit problem assumes stationary distributions for the rewards. It is often unrealistic to model the real world as a stationary distribution. In this paper we derive and evaluate algorithms using Thompson Sampling for a Switching Multi-Armed Bandit Problem. We propose a Thompson Sampling strategy equipped with a Bayesian change point mechanism to tackle this problem. We develop algorithms for a variety of cases with constant switching rate: when switching occurs all arms change (Global Switching), switching occurs independently for each arm (Per-Arm Switching), when the switching rate is known and when it must be inferred from data. This leads to a family of algorithms we collectively term Change-Point Thompson Sampling (CTS). We show empirical results in 4 artificial environments, and 2 derived from real world data: news click-through and foreign exchange data, comparing them to some other bandit algorithms. In real world data CTS is the most effective.

Related Material