Efficient Contextual Bandits in Non-stationary Worlds

Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, John Langford
Proceedings of the 31st Conference On Learning Theory, PMLR 75:1739-1776, 2018.

Abstract

Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret O(ST) if there are T rounds with S stationary periods, or more generally O(Δ1/3T2/3) where Δ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity S or Δ, we also develop a parameter free algorithm achieving regret min. This improves and generalizes the best existing result \Delta^{0.18}T^{0.82} by Karnin and Anava (2016) which only holds for the two-armed bandit problem.

Cite this Paper


BibTeX
@InProceedings{pmlr-v75-luo18a, title = {Efficient Contextual Bandits in Non-stationary Worlds}, author = {Luo, Haipeng and Wei, Chen-Yu and Agarwal, Alekh and Langford, John}, booktitle = {Proceedings of the 31st Conference On Learning Theory}, pages = {1739--1776}, year = {2018}, editor = {Bubeck, Sébastien and Perchet, Vianney and Rigollet, Philippe}, volume = {75}, series = {Proceedings of Machine Learning Research}, month = {06--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v75/luo18a/luo18a.pdf}, url = {https://proceedings.mlr.press/v75/luo18a.html}, abstract = {Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret $\mathcal{O}(\sqrt{ST})$ if there are $T$ rounds with $S$ stationary periods, or more generally $\mathcal{O}(\Delta^{1/3}T^{2/3})$ where $\Delta$ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity $S$ or $\Delta$, we also develop a parameter free algorithm achieving regret $\min\{S^{1/4}T^{3/4}, \Delta^{1/5}T^{4/5}\}$. This improves and generalizes the best existing result $\Delta^{0.18}T^{0.82}$ by Karnin and Anava (2016) which only holds for the two-armed bandit problem.} }
Endnote
%0 Conference Paper %T Efficient Contextual Bandits in Non-stationary Worlds %A Haipeng Luo %A Chen-Yu Wei %A Alekh Agarwal %A John Langford %B Proceedings of the 31st Conference On Learning Theory %C Proceedings of Machine Learning Research %D 2018 %E Sébastien Bubeck %E Vianney Perchet %E Philippe Rigollet %F pmlr-v75-luo18a %I PMLR %P 1739--1776 %U https://proceedings.mlr.press/v75/luo18a.html %V 75 %X Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret $\mathcal{O}(\sqrt{ST})$ if there are $T$ rounds with $S$ stationary periods, or more generally $\mathcal{O}(\Delta^{1/3}T^{2/3})$ where $\Delta$ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity $S$ or $\Delta$, we also develop a parameter free algorithm achieving regret $\min\{S^{1/4}T^{3/4}, \Delta^{1/5}T^{4/5}\}$. This improves and generalizes the best existing result $\Delta^{0.18}T^{0.82}$ by Karnin and Anava (2016) which only holds for the two-armed bandit problem.
APA
Luo, H., Wei, C., Agarwal, A. & Langford, J.. (2018). Efficient Contextual Bandits in Non-stationary Worlds. Proceedings of the 31st Conference On Learning Theory, in Proceedings of Machine Learning Research 75:1739-1776 Available from https://proceedings.mlr.press/v75/luo18a.html.

Related Material