MOTS: Minimax Optimal Thompson Sampling

Tianyuan Jin; Pan Xu; Jieming Shi; Xiaokui Xiao; Quanquan Gu

MOTS: Minimax Optimal Thompson Sampling

Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, Quanquan Gu

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5074-5083, 2021.

Abstract

Thompson sampling is one of the most widely used algorithms in many online decision problems due to its simplicity for implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret O(\sqrt{TK}) for K-armed bandit problems, where T is the total time horizon. In this paper we fill this long open gap by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound O(\sqrt{TK}) for finite time horizon T and also the asymptotic optimal regret bound when $T$ grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-jin21d,
  title = 	 {MOTS: Minimax Optimal Thompson Sampling},
  author =       {Jin, Tianyuan and Xu, Pan and Shi, Jieming and Xiao, Xiaokui and Gu, Quanquan},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {5074--5083},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/jin21d/jin21d.pdf},
  url = 	 {https://proceedings.mlr.press/v139/jin21d.html},
  abstract = 	 {Thompson sampling is one of the most widely used algorithms in many online decision problems due to its simplicity for implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret O(\sqrt{TK}) for K-armed bandit problems, where T is the total time horizon. In this paper we fill this long open gap by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound O(\sqrt{TK}) for finite time horizon T and also the asymptotic optimal regret bound when $T$ grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.}
}

Endnote

%0 Conference Paper
%T MOTS: Minimax Optimal Thompson Sampling
%A Tianyuan Jin
%A Pan Xu
%A Jieming Shi
%A Xiaokui Xiao
%A Quanquan Gu
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-jin21d
%I PMLR
%P 5074--5083
%U https://proceedings.mlr.press/v139/jin21d.html
%V 139
%X Thompson sampling is one of the most widely used algorithms in many online decision problems due to its simplicity for implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret O(\sqrt{TK}) for K-armed bandit problems, where T is the total time horizon. In this paper we fill this long open gap by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound O(\sqrt{TK}) for finite time horizon T and also the asymptotic optimal regret bound when $T$ grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.

APA

Jin, T., Xu, P., Shi, J., Xiao, X. & Gu, Q.. (2021). MOTS: Minimax Optimal Thompson Sampling. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5074-5083 Available from https://proceedings.mlr.press/v139/jin21d.html.

MOTS: Minimax Optimal Thompson Sampling

Abstract

Cite this Paper

Related Material