Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Yang Cao; Zheng Wen; Branislav Kveton; Yao Xie

Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:418-427, 2019.

Abstract

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset and a real-world digital marketing dataset to demonstrate its superior performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v89-cao19a,
  title = 	 {Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit},
  author =       {Cao, Yang and Wen, Zheng and Kveton, Branislav and Xie, Yao},
  booktitle = 	 {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics},
  pages = 	 {418--427},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Sugiyama, Masashi},
  volume = 	 {89},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--18 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v89/cao19a/cao19a.pdf},
  url = 	 {https://proceedings.mlr.press/v89/cao19a.html},
  abstract = 	 {Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments.  Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor.  We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset and a real-world digital marketing dataset to demonstrate its superior performance.}
}

Endnote

%0 Conference Paper
%T Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit
%A Yang Cao
%A Zheng Wen
%A Branislav Kveton
%A Yao Xie
%B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Masashi Sugiyama	
%F pmlr-v89-cao19a
%I PMLR
%P 418--427
%U https://proceedings.mlr.press/v89/cao19a.html
%V 89
%X Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments.  Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor.  We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset and a real-world digital marketing dataset to demonstrate its superior performance.

APA


Cao, Y., Wen, Z., Kveton, B. & Xie, Y.. (2019). Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:418-427 Available from https://proceedings.mlr.press/v89/cao19a.html.

Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Abstract

Cite this Paper

Related Material