Bayesian Change-Point Detection for Bandit Feedback in Non-stationary Environments

Reda Alami
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:17-31, 2023.

Abstract

The stochastic multi-armed bandit problem has been widely studied under the stationary assumption. However in real world problems and industrial applications, this assumption is often unrealistic because the distributions of rewards may change over time. In this paper, we consider the piece-wise iid non-stationary stochastic multi-armed bandit problem with unknown change-points and we focus on the change of mean setup. To solve the latter, we propose a change-point based framework where we study a class of change-detection based optimal bandit policies that actively detects change-point using the restarted Bayesian online change-point detector and then restarts the bandit indices. Analytically, in the context of regret minimization, our proposal achieves a $\mathcal{O}(\sqrt{A T K_T })$ regret upper-bound where $K_T$ is the overall number of change-points up to the horizon $T$ and $A$ is the number of arms. The derived bound matches the existing lower bound for abruptly changing environments. Finally, we demonstrate the cumulative regret reduction of the our proposal over synthetic Bernoulli rewards as well as Yahoo! datasets of webpage click-through rates.

Cite this Paper


BibTeX
@InProceedings{pmlr-v189-alami23a, title = {Bayesian Change-Point Detection for Bandit Feedback in Non-stationary Environments}, author = {Alami, Reda}, booktitle = {Proceedings of The 14th Asian Conference on Machine Learning}, pages = {17--31}, year = {2023}, editor = {Khan, Emtiyaz and Gonen, Mehmet}, volume = {189}, series = {Proceedings of Machine Learning Research}, month = {12--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v189/alami23a/alami23a.pdf}, url = {https://proceedings.mlr.press/v189/alami23a.html}, abstract = {The stochastic multi-armed bandit problem has been widely studied under the stationary assumption. However in real world problems and industrial applications, this assumption is often unrealistic because the distributions of rewards may change over time. In this paper, we consider the piece-wise iid non-stationary stochastic multi-armed bandit problem with unknown change-points and we focus on the change of mean setup. To solve the latter, we propose a change-point based framework where we study a class of change-detection based optimal bandit policies that actively detects change-point using the restarted Bayesian online change-point detector and then restarts the bandit indices. Analytically, in the context of regret minimization, our proposal achieves a $\mathcal{O}(\sqrt{A T K_T })$ regret upper-bound where $K_T$ is the overall number of change-points up to the horizon $T$ and $A$ is the number of arms. The derived bound matches the existing lower bound for abruptly changing environments. Finally, we demonstrate the cumulative regret reduction of the our proposal over synthetic Bernoulli rewards as well as Yahoo! datasets of webpage click-through rates.} }
Endnote
%0 Conference Paper %T Bayesian Change-Point Detection for Bandit Feedback in Non-stationary Environments %A Reda Alami %B Proceedings of The 14th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Emtiyaz Khan %E Mehmet Gonen %F pmlr-v189-alami23a %I PMLR %P 17--31 %U https://proceedings.mlr.press/v189/alami23a.html %V 189 %X The stochastic multi-armed bandit problem has been widely studied under the stationary assumption. However in real world problems and industrial applications, this assumption is often unrealistic because the distributions of rewards may change over time. In this paper, we consider the piece-wise iid non-stationary stochastic multi-armed bandit problem with unknown change-points and we focus on the change of mean setup. To solve the latter, we propose a change-point based framework where we study a class of change-detection based optimal bandit policies that actively detects change-point using the restarted Bayesian online change-point detector and then restarts the bandit indices. Analytically, in the context of regret minimization, our proposal achieves a $\mathcal{O}(\sqrt{A T K_T })$ regret upper-bound where $K_T$ is the overall number of change-points up to the horizon $T$ and $A$ is the number of arms. The derived bound matches the existing lower bound for abruptly changing environments. Finally, we demonstrate the cumulative regret reduction of the our proposal over synthetic Bernoulli rewards as well as Yahoo! datasets of webpage click-through rates.
APA
Alami, R.. (2023). Bayesian Change-Point Detection for Bandit Feedback in Non-stationary Environments. Proceedings of The 14th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 189:17-31 Available from https://proceedings.mlr.press/v189/alami23a.html.

Related Material