Rotting bandits are no harder than stochastic ones

Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, Michal Valko
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2564-2572, 2019.

Abstract

In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon T, and without any knowledge on the decreasing behavior of the K arms, FEWA achieves problem-dependent regret bound of $O(\log(KT))$, and a problem-independent one of $O(\sqrt(KT))$. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret $O(K^(1/3) T^(2/3)$. FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-seznec19a, title = {Rotting bandits are no harder than stochastic ones}, author = {Seznec, Julien and Locatelli, Andrea and Carpentier, Alexandra and Lazaric, Alessandro and Valko, Michal}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {2564--2572}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/seznec19a/seznec19a.pdf}, url = {https://proceedings.mlr.press/v89/seznec19a.html}, abstract = {In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon T, and without any knowledge on the decreasing behavior of the K arms, FEWA achieves problem-dependent regret bound of $O(\log(KT))$, and a problem-independent one of $O(\sqrt(KT))$. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret $O(K^(1/3) T^(2/3)$. FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA.} }
Endnote
%0 Conference Paper %T Rotting bandits are no harder than stochastic ones %A Julien Seznec %A Andrea Locatelli %A Alexandra Carpentier %A Alessandro Lazaric %A Michal Valko %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-seznec19a %I PMLR %P 2564--2572 %U https://proceedings.mlr.press/v89/seznec19a.html %V 89 %X In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon T, and without any knowledge on the decreasing behavior of the K arms, FEWA achieves problem-dependent regret bound of $O(\log(KT))$, and a problem-independent one of $O(\sqrt(KT))$. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret $O(K^(1/3) T^(2/3)$. FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA.
APA
Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A. & Valko, M.. (2019). Rotting bandits are no harder than stochastic ones. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:2564-2572 Available from https://proceedings.mlr.press/v89/seznec19a.html.

Related Material