Combinatorial semi-bandit in the non-stationary environment

Wei Chen, Liwei Wang, Haoyu Zhao, Kai Zheng
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:865-875, 2021.

Abstract

In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v161-chen21a, title = {Combinatorial semi-bandit in the non-stationary environment}, author = {Chen, Wei and Wang, Liwei and Zhao, Haoyu and Zheng, Kai}, booktitle = {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence}, pages = {865--875}, year = {2021}, editor = {de Campos, Cassio and Maathuis, Marloes H.}, volume = {161}, series = {Proceedings of Machine Learning Research}, month = {27--30 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v161/chen21a/chen21a.pdf}, url = {https://proceedings.mlr.press/v161/chen21a.html}, abstract = {In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.} }
Endnote
%0 Conference Paper %T Combinatorial semi-bandit in the non-stationary environment %A Wei Chen %A Liwei Wang %A Haoyu Zhao %A Kai Zheng %B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2021 %E Cassio de Campos %E Marloes H. Maathuis %F pmlr-v161-chen21a %I PMLR %P 865--875 %U https://proceedings.mlr.press/v161/chen21a.html %V 161 %X In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists (Wang and Chen, NIPS 2017), our algorithm achieves $\tilde{O}(m\sqrt{N T}/\Delta_{\min})$ distribution-dependent regret in the switching case, and $\tilde{O}({V}^{1/3}T^{2/3})$ distribution-independent regret in the dynamic case, where ${N}$ is the number of switchings and ${V}$ is the sum of the total “distribution changes”, $m$ is the total number of arms, and $\Delta_{\min}$ is a gap variable dependent on the distributions of arm outcomes. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter ${N}$ or ${V}$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters ${N}$ or ${V}$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we apply a new technique to design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.
APA
Chen, W., Wang, L., Zhao, H. & Zheng, K.. (2021). Combinatorial semi-bandit in the non-stationary environment. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:865-875 Available from https://proceedings.mlr.press/v161/chen21a.html.

Related Material