A Thompson Sampling Algorithm for Cascading Bandits
[edit]
Proceedings of Machine Learning Research, PMLR 89:438447, 2019.
Abstract
We design and analyze TSCascade, a Thompson sampling algorithm for the cascading bandit problem. In TSCascade, Bayesian estimates of the click probability are constructed using a univariate Gaussian; this leads to a more efficient exploration procedure visãvis existing UCBbased approaches. We also incorporate the empirical variance of each item’s click probability into the Bayesian updates. These two novel features allow us to prove an expected regret bound of the form $\tilde{O}(\sqrt{KLT})$ where $L$ and $K$ are the number of ground items and the number of items in the chosen list respectively and $T\ge L$ is the number of Thompson sampling update steps. This matches the stateoftheart regret bounds for UCBbased algorithms. More importantly, it is the first theoretical guarantee on a Thompson sampling algorithm for any stochastic combinatorial bandit problem model with partial feedback. Empirical experiments demonstrate superiority of TSCascade compared to existing UCBbased procedures in terms of the expected cumulative regret and the time complexity.
Related Material


