Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan; Han Zhong

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan, Han Zhong

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:6885-6886, 2026.

Abstract

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential behavior under adaptive data collection is subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random, history-dependent, and coupled with the observed rewards through the action-selection rule. A useful sufficient condition for valid asymptotic inference is \emph{stability}, which requires each arm’s pull count to concentrate around a deterministic scale. While stability is now understood for several UCB-type algorithms, vanilla TS can be unstable, leading to nonstandard asymptotics and potentially invalid Wald-type confidence intervals. We identify optimism as a general mechanism for stabilizing Thompson sampling. In the $K$-armed Gaussian bandit with any fixed $K\ge2$, we study two optimistic TS variants. The first is TS with posterior variance inflation; the second keeps the posterior variance unchanged but adds an explicit optimism bonus to the posterior mean. For both variants, we prove stability: optimal arms asymptotically share the horizon uniformly, while each suboptimal arm is sampled on a sharp gap-dependent logarithmic scale. For variance-inflated TS, this resolves the open problem posed by Halder et al. (2025) by extending their two-armed stability theory to general $K$-armed bandits, including instances with multiple optimal arms. For the mean-bonus variant, our result shows that stability can also be achieved through a direct optimistic shift of the posterior center, without inflating the posterior variance. The main technical novelty lies in the treatment of variance-inflated TS with multiple optimal arms. In this regime, stability requires proving that the randomized competition among statistically indistinguishable optimal arms converges to a deterministic allocation. We isolate a limiting pure-noise competition and prove a negative-feedback property: over-sampled optimal arms become less likely to win future posterior draws, while under-sampled ones become more likely to be selected. This yields a contraction toward the uniform allocation over the optimal set. Concentration and rare-event estimates then control the perturbations caused by empirical-mean errors and occasional suboptimal selections. For the mean-bonus variant, we use a separate argument based on posterior-sampling concentration and UCB-type comparisons, since optimism enters through a deterministic shift of the posterior mean rather than through variance inflation. These stability results imply asymptotically valid adaptive inference. In particular, for either optimistic TS variant, the usual studentized sample mean is asymptotically standard normal, and standard Wald confidence intervals achieve the nominal coverage probability despite adaptive sampling. Thus, suitably implemented optimism stabilizes Thompson sampling and enables classical inference from adaptively collected bandit data, while incurring only a mild additional regret cost.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-yan26a,
  title = 	 {Optimism Stabilizes Thompson Sampling for Adaptive Inference},
  author =       {Yan, Shunxing and Zhong, Han},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {6885--6886},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/yan26a/yan26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/yan26a.html},
  abstract = 	 {Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential behavior under adaptive data collection is subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random, history-dependent, and coupled with the observed rewards through the action-selection rule. A useful sufficient condition for valid asymptotic inference is \emph{stability}, which requires each arm’s pull count to concentrate around a deterministic scale. While stability is now understood for several UCB-type algorithms, vanilla TS can be unstable, leading to nonstandard asymptotics and potentially invalid Wald-type confidence intervals. We identify optimism as a general mechanism for stabilizing Thompson sampling. In the $K$-armed Gaussian bandit with any fixed $K\ge2$, we study two optimistic TS variants. The first is TS with posterior variance inflation; the second keeps the posterior variance unchanged but adds an explicit optimism bonus to the posterior mean. For both variants, we prove stability: optimal arms asymptotically share the horizon uniformly, while each suboptimal arm is sampled on a sharp gap-dependent logarithmic scale. For variance-inflated TS, this resolves the open problem posed by Halder et al. (2025) by extending their two-armed stability theory to general $K$-armed bandits, including instances with multiple optimal arms. For the mean-bonus variant, our result shows that stability can also be achieved through a direct optimistic shift of the posterior center, without inflating the posterior variance. The main technical novelty lies in the treatment of variance-inflated TS with multiple optimal arms. In this regime, stability requires proving that the randomized competition among statistically indistinguishable optimal arms converges to a deterministic allocation. We isolate a limiting pure-noise competition and prove a negative-feedback property: over-sampled optimal arms become less likely to win future posterior draws, while under-sampled ones become more likely to be selected. This yields a contraction toward the uniform allocation over the optimal set. Concentration and rare-event estimates then control the perturbations caused by empirical-mean errors and occasional suboptimal selections. For the mean-bonus variant, we use a separate argument based on posterior-sampling concentration and UCB-type comparisons, since optimism enters through a deterministic shift of the posterior mean rather than through variance inflation. These stability results imply asymptotically valid adaptive inference. In particular, for either optimistic TS variant, the usual studentized sample mean is asymptotically standard normal, and standard Wald confidence intervals achieve the nominal coverage probability despite adaptive sampling. Thus, suitably implemented optimism stabilizes Thompson sampling and enables classical inference from adaptively collected bandit data, while incurring only a mild additional regret cost.}
}

Endnote

%0 Conference Paper
%T Optimism Stabilizes Thompson Sampling for Adaptive Inference
%A Shunxing Yan
%A Han Zhong
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-yan26a
%I PMLR
%P 6885--6886
%U https://proceedings.mlr.press/v336/yan26a.html
%V 336
%X Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential behavior under adaptive data collection is subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random, history-dependent, and coupled with the observed rewards through the action-selection rule. A useful sufficient condition for valid asymptotic inference is \emph{stability}, which requires each arm’s pull count to concentrate around a deterministic scale. While stability is now understood for several UCB-type algorithms, vanilla TS can be unstable, leading to nonstandard asymptotics and potentially invalid Wald-type confidence intervals. We identify optimism as a general mechanism for stabilizing Thompson sampling. In the $K$-armed Gaussian bandit with any fixed $K\ge2$, we study two optimistic TS variants. The first is TS with posterior variance inflation; the second keeps the posterior variance unchanged but adds an explicit optimism bonus to the posterior mean. For both variants, we prove stability: optimal arms asymptotically share the horizon uniformly, while each suboptimal arm is sampled on a sharp gap-dependent logarithmic scale. For variance-inflated TS, this resolves the open problem posed by Halder et al. (2025) by extending their two-armed stability theory to general $K$-armed bandits, including instances with multiple optimal arms. For the mean-bonus variant, our result shows that stability can also be achieved through a direct optimistic shift of the posterior center, without inflating the posterior variance. The main technical novelty lies in the treatment of variance-inflated TS with multiple optimal arms. In this regime, stability requires proving that the randomized competition among statistically indistinguishable optimal arms converges to a deterministic allocation. We isolate a limiting pure-noise competition and prove a negative-feedback property: over-sampled optimal arms become less likely to win future posterior draws, while under-sampled ones become more likely to be selected. This yields a contraction toward the uniform allocation over the optimal set. Concentration and rare-event estimates then control the perturbations caused by empirical-mean errors and occasional suboptimal selections. For the mean-bonus variant, we use a separate argument based on posterior-sampling concentration and UCB-type comparisons, since optimism enters through a deterministic shift of the posterior mean rather than through variance inflation. These stability results imply asymptotically valid adaptive inference. In particular, for either optimistic TS variant, the usual studentized sample mean is asymptotically standard normal, and standard Wald confidence intervals achieve the nominal coverage probability despite adaptive sampling. Thus, suitably implemented optimism stabilizes Thompson sampling and enables classical inference from adaptively collected bandit data, while incurring only a mild additional regret cost.

APA

Yan, S. & Zhong, H.. (2026). Optimism Stabilizes Thompson Sampling for Adaptive Inference. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:6885-6886 Available from https://proceedings.mlr.press/v336/yan26a.html.

Related Material

Download PDF