A Simple Approach for Non-stationary Linear Bandits

Peng Zhao; Lijun Zhang; Yuan Jiang; Zhi-Hua Zhou

A Simple Approach for Non-stationary Linear Bandits

Peng Zhao, Lijun Zhang, Yuan Jiang, Zhi-Hua Zhou

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:746-755, 2020.

Abstract

This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let

$T$ be the time horizon,

$d$ be the dimension, and

$P_T$ be the path-length that measures the fluctuation of the evolving unknown parameter, our approach enjoys an

$\tilde{O}(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ dynamic regret, which is nearly optimal, matching the

$\Omega(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach.

Cite this Paper

BibTeX


@InProceedings{pmlr-v108-zhao20a,
  title = 	 {A Simple Approach for Non-stationary Linear Bandits},
  author =       {Zhao, Peng and Zhang, Lijun and Jiang, Yuan and Zhou, Zhi-Hua},
  booktitle = 	 {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics},
  pages = 	 {746--755},
  year = 	 {2020},
  editor = 	 {Chiappa, Silvia and Calandra, Roberto},
  volume = 	 {108},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {26--28 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v108/zhao20a/zhao20a.pdf},
  url = 	 {https://proceedings.mlr.press/v108/zhao20a.html},
  abstract = 	 {This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let $T$ be the time horizon, $d$ be the dimension, and $P_T$ be the path-length that measures the fluctuation of the evolving unknown parameter, our approach enjoys an $\tilde{O}(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ dynamic regret, which is nearly optimal, matching the $\Omega(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach. }
}

Endnote

%0 Conference Paper
%T A Simple Approach for Non-stationary Linear Bandits
%A Peng Zhao
%A Lijun Zhang
%A Yuan Jiang
%A Zhi-Hua Zhou
%B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2020
%E Silvia Chiappa
%E Roberto Calandra	
%F pmlr-v108-zhao20a
%I PMLR
%P 746--755
%U https://proceedings.mlr.press/v108/zhao20a.html
%V 108
%X This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Previous studies have adopted sophisticated mechanisms, such as sliding window or weighted penalty to achieve near-optimal dynamic regret. In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Let $T$ be the time horizon, $d$ be the dimension, and $P_T$ be the path-length that measures the fluctuation of the evolving unknown parameter, our approach enjoys an $\tilde{O}(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ dynamic regret, which is nearly optimal, matching the $\Omega(d^{2/3}(1+P_T)^{1/3}T^{2/3})$ minimax lower bound up to logarithmic factors. Empirical studies also validate the efficacy of our approach.

APA


Zhao, P., Zhang, L., Jiang, Y. & Zhou, Z.. (2020). A Simple Approach for Non-stationary Linear Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:746-755 Available from https://proceedings.mlr.press/v108/zhao20a.html.

A Simple Approach for Non-stationary Linear Bandits

Abstract

Cite this Paper

Related Material