Smooth Non-stationary Bandits

Su Jia; Qian Xie; Nathan Kallus; Peter I. Frazier

Smooth Non-stationary Bandits

Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:14930-14944, 2023.

Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee

$T^{2/3}$ regret. However, in practice environments are often changing smoothly, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm’s mean reward is a

$\beta$ -Hölder function over (normalized) time, meaning it is

$(\beta-1)$ -times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with

$T^{3/5}$ regret for

$\beta=2$ . We complement this result by a

$T^{\frac{\beta+1}{2\beta+1}}$ lower bound for any integer

$\beta\ge 1$ , which matches our upper bound for

$\beta=2$ .

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-jia23c,
  title = 	 {Smooth Non-stationary Bandits},
  author =       {Jia, Su and Xie, Qian and Kallus, Nathan and Frazier, Peter I.},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {14930--14944},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/jia23c/jia23c.pdf},
  url = 	 {https://proceedings.mlr.press/v202/jia23c.html},
  abstract = 	 {In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee $T^{2/3}$ regret. However, in practice environments are often changing smoothly, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm’s mean reward is a $\beta$-Hölder function over (normalized) time, meaning it is $(\beta-1)$-times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with $T^{3/5}$ regret for $\beta=2$. We complement this result by a $T^{\frac{\beta+1}{2\beta+1}}$ lower bound for any integer $\beta\ge 1$, which matches our upper bound for $\beta=2$.}
}

Endnote

%0 Conference Paper
%T Smooth Non-stationary Bandits
%A Su Jia
%A Qian Xie
%A Nathan Kallus
%A Peter I. Frazier
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-jia23c
%I PMLR
%P 14930--14944
%U https://proceedings.mlr.press/v202/jia23c.html
%V 202
%X In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee $T^{2/3}$ regret. However, in practice environments are often changing smoothly, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm’s mean reward is a $\beta$-Hölder function over (normalized) time, meaning it is $(\beta-1)$-times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with $T^{3/5}$ regret for $\beta=2$. We complement this result by a $T^{\frac{\beta+1}{2\beta+1}}$ lower bound for any integer $\beta\ge 1$, which matches our upper bound for $\beta=2$.

APA


Jia, S., Xie, Q., Kallus, N. & Frazier, P.I.. (2023). Smooth Non-stationary Bandits. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:14930-14944 Available from https://proceedings.mlr.press/v202/jia23c.html.

Smooth Non-stationary Bandits

Abstract

Cite this Paper

Related Material