A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free

Yifang Chen; Chung-Wei Lee; Haipeng Luo; Chen-Yu Wei

A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free

Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei

Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:696-726, 2019.

Abstract

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves $\mathcal{O}(\min\{\sqrt{KST}, K^{\frac{1}{3}}\Delta ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ dynamic regret for a contextual bandit problem with $T$ rounds, $K$ actions, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O} (\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al., 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce {\it replay phases}, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v99-chen19b,
  title = 	 {A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free},
  author =       {Chen, Yifang and Lee, Chung-Wei and Luo, Haipeng and Wei, Chen-Yu},
  booktitle = 	 {Proceedings of the Thirty-Second Conference on Learning Theory},
  pages = 	 {696--726},
  year = 	 {2019},
  editor = 	 {Beygelzimer, Alina and Hsu, Daniel},
  volume = 	 {99},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--28 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v99/chen19b/chen19b.pdf},
  url = 	 {https://proceedings.mlr.press/v99/chen19b.html},
  abstract = 	 {We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves $\mathcal{O}(\min\{\sqrt{KST}, K^{\frac{1}{3}}\Delta ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ dynamic regret for a contextual bandit problem with $T$ rounds, $K$ actions, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O} (\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al., 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce {\it replay phases}, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.}
}

Endnote

%0 Conference Paper
%T A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free
%A Yifang Chen
%A Chung-Wei Lee
%A Haipeng Luo
%A Chen-Yu Wei
%B Proceedings of the Thirty-Second Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2019
%E Alina Beygelzimer
%E Daniel Hsu	
%F pmlr-v99-chen19b
%I PMLR
%P 696--726
%U https://proceedings.mlr.press/v99/chen19b.html
%V 99
%X We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves $\mathcal{O}(\min\{\sqrt{KST}, K^{\frac{1}{3}}\Delta ^{\frac{1}{3}}T^{\frac{2}{3}}\})$ dynamic regret for a contextual bandit problem with $T$ rounds, $K$ actions, $S$ switches and $\Delta$ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know $S$ or $\Delta$ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the $\mathcal{O} (\min \{S^{\frac{1}{4}}T^{\frac{3}{4}}, \Delta^{\frac{1}{5}}T^{\frac{4}{5}}\})$ bound of (Luo et al., 2018), and greatly generalize and improve the $\mathcal{O}(\sqrt{ST})$ result of (Auer et al., 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce {\it replay phases}, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

APA

Chen, Y., Lee, C., Luo, H. & Wei, C.. (2019). A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free. Proceedings of the Thirty-Second Conference on Learning Theory, in Proceedings of Machine Learning Research 99:696-726 Available from https://proceedings.mlr.press/v99/chen19b.html.

Related Material

Download PDF