Avoiding exp($k^*$) Scaling for Thompson Sampling in Combinatorial Semi-Bandits: From Multiple Seeds to a Single Seed

Tianyuan Jin, Heyang Zhao, Vincent Y. F. Tan, Quanquan Gu
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:3825-3855, 2026.

Abstract

The Combinatorial Multi-Armed Bandit (CMAB) framework extends classical multi-armed bandit theory to complex decision-making settings where agents select super arms to maximize a collective reward. While Thompson Sampling (TS) is widely favored for its robust empirical performance in these settings, its theoretical guarantees have historically suffered from a significant bottleneck: standard Combinatorial Thompson Sampling (\texttt{CTS}) incurs a regret bound with an exponential dependence on the size $k^*$ of the optimal super-arm. This exponential term arises because standard independent posterior sampling fails to coordinate optimism across the base arms of the optimal super arm, causing the probability of exploration to vanish as $k^*$ increases. Although recent advances have achieved polynomial regret for \emph{linear} rewards, designing an efficient TS algorithm for general, non-linear CMABs remains an open challenge. In this paper, we resolve this open question by proposing \emph{Combinatorial Thompson Sampling with a Single Seed} (\texttt{CTS$^3$}). Unlike standard approaches that sample base arms independently, \texttt{CTS$^3$} employs a comonotonic coupling strategy: it generates parameters for all base arms using a single shared random seed via the inverse CDF transform. This mechanism synchronizes sampling fluctuations across arms, ensuring concerted optimism and preventing the exploration probability from decaying exponentially. We prove that \texttt{CTS$^3$} achieves a regret bound of ${O}\left( \frac{m kk^*B^2}{\Delta_{\min}}\poly(\log(T,m,\Delta_{\max}/\Delta_{\min}))\right)$ for general reward functions satisfying monotonicity and bounded smoothness, where $m$ is the number of total base arms, $k$ is the largest super arm size, and $k^*$ is size of the optimal arm. To the best of our knowledge, this is the first polynomial regret bound for Thompson Sampling in general CMAB settings. Empirical evaluations confirm that \texttt{CTS$^3$} significantly outperforms standard independent TS, particularly in regimes with large super arms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-jin26a, title = {Avoiding exp($k^*$) Scaling for Thompson Sampling in Combinatorial Semi-Bandits: From Multiple Seeds to a Single Seed}, author = {Jin, Tianyuan and Zhao, Heyang and Tan, Vincent Y. F. and Gu, Quanquan}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {3825--3855}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/jin26a/jin26a.pdf}, url = {https://proceedings.mlr.press/v336/jin26a.html}, abstract = {The Combinatorial Multi-Armed Bandit (CMAB) framework extends classical multi-armed bandit theory to complex decision-making settings where agents select super arms to maximize a collective reward. While Thompson Sampling (TS) is widely favored for its robust empirical performance in these settings, its theoretical guarantees have historically suffered from a significant bottleneck: standard Combinatorial Thompson Sampling (\texttt{CTS}) incurs a regret bound with an exponential dependence on the size $k^*$ of the optimal super-arm. This exponential term arises because standard independent posterior sampling fails to coordinate optimism across the base arms of the optimal super arm, causing the probability of exploration to vanish as $k^*$ increases. Although recent advances have achieved polynomial regret for \emph{linear} rewards, designing an efficient TS algorithm for general, non-linear CMABs remains an open challenge. In this paper, we resolve this open question by proposing \emph{Combinatorial Thompson Sampling with a Single Seed} (\texttt{CTS$^3$}). Unlike standard approaches that sample base arms independently, \texttt{CTS$^3$} employs a comonotonic coupling strategy: it generates parameters for all base arms using a single shared random seed via the inverse CDF transform. This mechanism synchronizes sampling fluctuations across arms, ensuring concerted optimism and preventing the exploration probability from decaying exponentially. We prove that \texttt{CTS$^3$} achieves a regret bound of ${O}\left( \frac{m kk^*B^2}{\Delta_{\min}}\poly(\log(T,m,\Delta_{\max}/\Delta_{\min}))\right)$ for general reward functions satisfying monotonicity and bounded smoothness, where $m$ is the number of total base arms, $k$ is the largest super arm size, and $k^*$ is size of the optimal arm. To the best of our knowledge, this is the first polynomial regret bound for Thompson Sampling in general CMAB settings. Empirical evaluations confirm that \texttt{CTS$^3$} significantly outperforms standard independent TS, particularly in regimes with large super arms.} }
Endnote
%0 Conference Paper %T Avoiding exp($k^*$) Scaling for Thompson Sampling in Combinatorial Semi-Bandits: From Multiple Seeds to a Single Seed %A Tianyuan Jin %A Heyang Zhao %A Vincent Y. F. Tan %A Quanquan Gu %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-jin26a %I PMLR %P 3825--3855 %U https://proceedings.mlr.press/v336/jin26a.html %V 336 %X The Combinatorial Multi-Armed Bandit (CMAB) framework extends classical multi-armed bandit theory to complex decision-making settings where agents select super arms to maximize a collective reward. While Thompson Sampling (TS) is widely favored for its robust empirical performance in these settings, its theoretical guarantees have historically suffered from a significant bottleneck: standard Combinatorial Thompson Sampling (\texttt{CTS}) incurs a regret bound with an exponential dependence on the size $k^*$ of the optimal super-arm. This exponential term arises because standard independent posterior sampling fails to coordinate optimism across the base arms of the optimal super arm, causing the probability of exploration to vanish as $k^*$ increases. Although recent advances have achieved polynomial regret for \emph{linear} rewards, designing an efficient TS algorithm for general, non-linear CMABs remains an open challenge. In this paper, we resolve this open question by proposing \emph{Combinatorial Thompson Sampling with a Single Seed} (\texttt{CTS$^3$}). Unlike standard approaches that sample base arms independently, \texttt{CTS$^3$} employs a comonotonic coupling strategy: it generates parameters for all base arms using a single shared random seed via the inverse CDF transform. This mechanism synchronizes sampling fluctuations across arms, ensuring concerted optimism and preventing the exploration probability from decaying exponentially. We prove that \texttt{CTS$^3$} achieves a regret bound of ${O}\left( \frac{m kk^*B^2}{\Delta_{\min}}\poly(\log(T,m,\Delta_{\max}/\Delta_{\min}))\right)$ for general reward functions satisfying monotonicity and bounded smoothness, where $m$ is the number of total base arms, $k$ is the largest super arm size, and $k^*$ is size of the optimal arm. To the best of our knowledge, this is the first polynomial regret bound for Thompson Sampling in general CMAB settings. Empirical evaluations confirm that \texttt{CTS$^3$} significantly outperforms standard independent TS, particularly in regimes with large super arms.
APA
Jin, T., Zhao, H., Tan, V.Y.F. & Gu, Q.. (2026). Avoiding exp($k^*$) Scaling for Thompson Sampling in Combinatorial Semi-Bandits: From Multiple Seeds to a Single Seed. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:3825-3855 Available from https://proceedings.mlr.press/v336/jin26a.html.

Related Material