Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits

Tianyuan Jin; Jing Tang; Pan Xu; Keke Huang; Xiaokui Xiao; Quanquan Gu

Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits

Tianyuan Jin, Jing Tang, Pan Xu, Keke Huang, Xiaokui Xiao, Quanquan Gu

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5065-5073, 2021.

Abstract

In batched multi-armed bandit problems, the learner can adaptively pull arms and adjust strategy in batches. In many real applications, not only the regret but also the batch complexity need to be optimized. Existing batched bandit algorithms usually assume that the time horizon T is known in advance. However, many applications involve an unpredictable stopping time. In this paper, we study the anytime batched multi-armed bandit problem. We propose an anytime algorithm that achieves the asymptotically optimal regret for exponential families of reward distributions with $O(\log \log T \ilog^{\alpha} (T))$ \footnote{Notation \ilog^{\alpha} (T) is the result of iteratively applying the logarithm function on T for \alpha times, e.g., \ilog^{3} (T)=\log\log\log T.} batches, where $\alpha\in O_{T}(1)$. Moreover, we prove that for any constant c>0, no algorithm can achieve the asymptotically optimal regret within c\log\log T batches.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-jin21c,
  title = 	 {Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits},
  author =       {Jin, Tianyuan and Tang, Jing and Xu, Pan and Huang, Keke and Xiao, Xiaokui and Gu, Quanquan},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {5065--5073},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/jin21c/jin21c.pdf},
  url = 	 {https://proceedings.mlr.press/v139/jin21c.html},
  abstract = 	 {In batched multi-armed bandit problems, the learner can adaptively pull arms and adjust strategy in batches. In many real applications, not only the regret but also the batch complexity need to be optimized. Existing batched bandit algorithms usually assume that the time horizon T is known in advance. However, many applications involve an unpredictable stopping time. In this paper, we study the anytime batched multi-armed bandit problem. We propose an anytime algorithm that achieves the asymptotically optimal regret for exponential families of reward distributions with $O(\log \log T \ilog^{\alpha} (T))$ \footnote{Notation \ilog^{\alpha} (T) is the result of iteratively applying the logarithm function on T for \alpha times, e.g., \ilog^{3} (T)=\log\log\log T.} batches, where $\alpha\in O_{T}(1)$. Moreover, we prove that for any constant c>0, no algorithm can achieve the asymptotically optimal regret within c\log\log T batches.}
}

Endnote

%0 Conference Paper
%T Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits
%A Tianyuan Jin
%A Jing Tang
%A Pan Xu
%A Keke Huang
%A Xiaokui Xiao
%A Quanquan Gu
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-jin21c
%I PMLR
%P 5065--5073
%U https://proceedings.mlr.press/v139/jin21c.html
%V 139
%X In batched multi-armed bandit problems, the learner can adaptively pull arms and adjust strategy in batches. In many real applications, not only the regret but also the batch complexity need to be optimized. Existing batched bandit algorithms usually assume that the time horizon T is known in advance. However, many applications involve an unpredictable stopping time. In this paper, we study the anytime batched multi-armed bandit problem. We propose an anytime algorithm that achieves the asymptotically optimal regret for exponential families of reward distributions with $O(\log \log T \ilog^{\alpha} (T))$ \footnote{Notation \ilog^{\alpha} (T) is the result of iteratively applying the logarithm function on T for \alpha times, e.g., \ilog^{3} (T)=\log\log\log T.} batches, where $\alpha\in O_{T}(1)$. Moreover, we prove that for any constant c>0, no algorithm can achieve the asymptotically optimal regret within c\log\log T batches.

APA

Jin, T., Tang, J., Xu, P., Huang, K., Xiao, X. & Gu, Q.. (2021). Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5065-5073 Available from https://proceedings.mlr.press/v139/jin21c.html.

Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits

Abstract

Cite this Paper

Related Material