Combinatorial semi-bandit in the non-stationary environment

Wei Chen; Liwei Wang; Haoyu Zhao; Kai Zheng

Combinatorial semi-bandit in the non-stationary environment

Wei Chen, Liwei Wang, Haoyu Zhao, Kai Zheng

Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:865-875, 2021.

Abstract

In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v161-chen21a,
  title = 	 {Combinatorial semi-bandit in the non-stationary environment},
  author =       {Chen, Wei and Wang, Liwei and Zhao, Haoyu and Zheng, Kai},
  booktitle = 	 {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {865--875},
  year = 	 {2021},
  editor = 	 {de Campos, Cassio and Maathuis, Marloes H.},
  volume = 	 {161},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v161/chen21a/chen21a.pdf},
  url = 	 {https://proceedings.mlr.press/v161/chen21a.html},
  abstract = 	 {In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.}
}

Endnote

%0 Conference Paper
%T Combinatorial semi-bandit in the non-stationary environment
%A Wei Chen
%A Liwei Wang
%A Haoyu Zhao
%A Kai Zheng
%B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2021
%E Cassio de Campos
%E Marloes H. Maathuis	
%F pmlr-v161-chen21a
%I PMLR
%P 865--875
%U https://proceedings.mlr.press/v161/chen21a.html
%V 161
%X In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

APA


Chen, W., Wang, L., Zhao, H. & Zheng, K.. (2021). Combinatorial semi-bandit in the non-stationary environment. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:865-875 Available from https://proceedings.mlr.press/v161/chen21a.html.

Combinatorial semi-bandit in the non-stationary environment

Abstract

Cite this Paper

Related Material