Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Baekjin Kim; Ambuj Tewari

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Baekjin Kim, Ambuj Tewari

Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR 124:71-80, 2020.

Abstract

We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{O}(d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.

Cite this Paper

BibTeX

@InProceedings{pmlr-v124-kim20a,
  title = 	 {Randomized Exploration for Non-Stationary Stochastic Linear Bandits},
  author =       {Kim, Baekjin and Tewari, Ambuj},
  booktitle = 	 {Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)},
  pages = 	 {71--80},
  year = 	 {2020},
  editor = 	 {Peters, Jonas and Sontag, David},
  volume = 	 {124},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--06 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v124/kim20a/kim20a.pdf},
  url = 	 {https://proceedings.mlr.press/v124/kim20a.html},
  abstract = 	 {We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{O}(d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.}
}

Endnote

%0 Conference Paper
%T Randomized Exploration for Non-Stationary Stochastic Linear Bandits
%A Baekjin Kim
%A Ambuj Tewari
%B Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)
%C Proceedings of Machine Learning Research
%D 2020
%E Jonas Peters
%E David Sontag	
%F pmlr-v124-kim20a
%I PMLR
%P 71--80
%U https://proceedings.mlr.press/v124/kim20a.html
%V 124
%X We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{O}(d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.

APA

Kim, B. & Tewari, A.. (2020). Randomized Exploration for Non-Stationary Stochastic Linear Bandits. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), in Proceedings of Machine Learning Research 124:71-80 Available from https://proceedings.mlr.press/v124/kim20a.html.

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Abstract

Cite this Paper

Related Material