Thompson Sampling for Combinatorial Semi-Bandits

Siwei Wang; Wei Chen

Thompson Sampling for Combinatorial Semi-Bandits

Siwei Wang, Wei Chen

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5114-5122, 2018.

Abstract

We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(m\log T / \Delta_{\min}) $ for TS under general CMAB, where $m$ is the number of arms, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v80-wang18a,
  title = 	 {Thompson Sampling for Combinatorial Semi-Bandits},
  author =       {Wang, Siwei and Chen, Wei},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {5114--5122},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/wang18a/wang18a.pdf},
  url = 	 {https://proceedings.mlr.press/v80/wang18a.html},
  abstract = 	 {We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(m\log T / \Delta_{\min}) $ for TS under general CMAB, where $m$ is the number of arms, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.}
}

Endnote

%0 Conference Paper
%T Thompson Sampling for Combinatorial Semi-Bandits
%A Siwei Wang
%A Wei Chen
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-wang18a
%I PMLR
%P 5114--5122
%U https://proceedings.mlr.press/v80/wang18a.html
%V 80
%X We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(m\log T / \Delta_{\min}) $ for TS under general CMAB, where $m$ is the number of arms, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.

APA


Wang, S. & Chen, W.. (2018). Thompson Sampling for Combinatorial Semi-Bandits. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5114-5122 Available from https://proceedings.mlr.press/v80/wang18a.html.

Thompson Sampling for Combinatorial Semi-Bandits

Abstract

Cite this Paper

Related Material