Thompson Sampling for Combinatorial SemiBandits
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:51145122, 2018.
Abstract
We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multiarmed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distributiondependent regret bound of $O(m\log T / \Delta_{\min}) $ for TS under general CMAB, where $m$ is the number of arms, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any nonoptimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.
Related Material


