Rotting Infinitely Many-Armed Bandits

Jung-Hun Kim; Milan Vojnovic; Se-Young Yun

Rotting Infinitely Many-Armed Bandits

Jung-Hun Kim, Milan Vojnovic, Se-Young Yun

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:11229-11254, 2022.

Abstract

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate

$\varrho=o(1)$ . We show that this learning problem has an

$\Omega(\max\{\varrho^{1/3}T, \sqrt{T}\})$ worst-case regret lower bound where

$T$ is the time horizon. We show that a matching upper bound

$\tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\})$ , up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate

$\varrho$ . We also show that an

$\tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of

$\varrho$ , by using an adaptive UCB index along with an adaptive threshold value.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-kim22j,
  title = 	 {Rotting Infinitely Many-Armed Bandits},
  author =       {Kim, Jung-Hun and Vojnovic, Milan and Yun, Se-Young},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {11229--11254},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/kim22j/kim22j.pdf},
  url = 	 {https://proceedings.mlr.press/v162/kim22j.html},
  abstract = 	 {We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T, \sqrt{T}\})$ worst-case regret lower bound where $T$ is the time horizon. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.}
}

Endnote

%0 Conference Paper
%T Rotting Infinitely Many-Armed Bandits
%A Jung-Hun Kim
%A Milan Vojnovic
%A Se-Young Yun
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-kim22j
%I PMLR
%P 11229--11254
%U https://proceedings.mlr.press/v162/kim22j.html
%V 162
%X We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T, \sqrt{T}\})$ worst-case regret lower bound where $T$ is the time horizon. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.

APA


Kim, J., Vojnovic, M. & Yun, S.. (2022). Rotting Infinitely Many-Armed Bandits. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:11229-11254 Available from https://proceedings.mlr.press/v162/kim22j.html.

Related Material

Download PDF