Rotting Infinitely Many-Armed Bandits

Jung-Hun Kim, Milan Vojnovic, Se-Young Yun
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:11229-11254, 2022.

Abstract

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate ϱ=o(1). We show that this learning problem has an Ω(max worst-case regret lower bound where T is the time horizon. We show that a matching upper bound \tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\}), up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate \varrho. We also show that an \tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\}) regret upper bound can be achieved by an algorithm that does not know the value of \varrho, by using an adaptive UCB index along with an adaptive threshold value.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-kim22j, title = {Rotting Infinitely Many-Armed Bandits}, author = {Kim, Jung-Hun and Vojnovic, Milan and Yun, Se-Young}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {11229--11254}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/kim22j/kim22j.pdf}, url = {https://proceedings.mlr.press/v162/kim22j.html}, abstract = {We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T, \sqrt{T}\})$ worst-case regret lower bound where $T$ is the time horizon. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.} }
Endnote
%0 Conference Paper %T Rotting Infinitely Many-Armed Bandits %A Jung-Hun Kim %A Milan Vojnovic %A Se-Young Yun %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-kim22j %I PMLR %P 11229--11254 %U https://proceedings.mlr.press/v162/kim22j.html %V 162 %X We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T, \sqrt{T}\})$ worst-case regret lower bound where $T$ is the time horizon. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.
APA
Kim, J., Vojnovic, M. & Yun, S.. (2022). Rotting Infinitely Many-Armed Bandits. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:11229-11254 Available from https://proceedings.mlr.press/v162/kim22j.html.

Related Material