Dynamic Planning and Learning under Recovering Rewards

David Simchi-Levi; Zeyu Zheng; Feng Zhu

Dynamic Planning and Learning under Recovering Rewards

David Simchi-Levi, Zeyu Zheng, Feng Zhu

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9702-9711, 2021.

Abstract

Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most

$K$ out of

$N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non-parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over

$T$ time periods, we propose, construct and prove performance guarantees for a class of “Purely Periodic Policies”. For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of

$1-\mathcal O(1/\sqrt{K})$ , which is asymptotically optimal when

$K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has

$\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-simchi-levi21a,
  title = 	 {Dynamic Planning and Learning under Recovering Rewards},
  author =       {Simchi-Levi, David and Zheng, Zeyu and Zhu, Feng},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {9702--9711},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/simchi-levi21a/simchi-levi21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/simchi-levi21a.html},
  abstract = 	 {Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non-parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of “Purely Periodic Policies”. For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.}
}

Endnote

%0 Conference Paper
%T Dynamic Planning and Learning under Recovering Rewards
%A David Simchi-Levi
%A Zeyu Zheng
%A Feng Zhu
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-simchi-levi21a
%I PMLR
%P 9702--9711
%U https://proceedings.mlr.press/v139/simchi-levi21a.html
%V 139
%X Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non-parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of “Purely Periodic Policies”. For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.

APA


Simchi-Levi, D., Zheng, Z. & Zhu, F.. (2021). Dynamic Planning and Learning under Recovering Rewards. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:9702-9711 Available from https://proceedings.mlr.press/v139/simchi-levi21a.html.

Dynamic Planning and Learning under Recovering Rewards

Abstract

Cite this Paper

Related Material