Communication-Constrained Bandits under Additive Gaussian Noise

Prathamesh Mayekar, Jonathan Scarlett, Vincent Y. F. Tan
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:24236-24250, 2023.

Abstract

We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than $P$, and this encoded reward is further corrupted by additive Gaussian noise of variance $\sigma^2$; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax regret of any scheme, where $\mathtt{SNR}\coloneqq \frac{P}{\sigma^2}$, and $K$ and $T$ are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$, which matches this lower bound to a minor additive factor. $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$ performs uniform exploration in its initial phases and then utilizes the upper confidence bound (UCB) bandit algorithm in its final phase. An interesting feature of $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-mayekar23a, title = {Communication-Constrained Bandits under Additive {G}aussian Noise}, author = {Mayekar, Prathamesh and Scarlett, Jonathan and Tan, Vincent Y. F.}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {24236--24250}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/mayekar23a/mayekar23a.pdf}, url = {https://proceedings.mlr.press/v202/mayekar23a.html}, abstract = {We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than $P$, and this encoded reward is further corrupted by additive Gaussian noise of variance $\sigma^2$; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax regret of any scheme, where $\mathtt{SNR}\coloneqq \frac{P}{\sigma^2}$, and $K$ and $T$ are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$, which matches this lower bound to a minor additive factor. $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$ performs uniform exploration in its initial phases and then utilizes the upper confidence bound (UCB) bandit algorithm in its final phase. An interesting feature of $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound.} }
Endnote
%0 Conference Paper %T Communication-Constrained Bandits under Additive Gaussian Noise %A Prathamesh Mayekar %A Jonathan Scarlett %A Vincent Y. F. Tan %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-mayekar23a %I PMLR %P 24236--24250 %U https://proceedings.mlr.press/v202/mayekar23a.html %V 202 %X We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than $P$, and this encoded reward is further corrupted by additive Gaussian noise of variance $\sigma^2$; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax regret of any scheme, where $\mathtt{SNR}\coloneqq \frac{P}{\sigma^2}$, and $K$ and $T$ are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$, which matches this lower bound to a minor additive factor. $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$ performs uniform exploration in its initial phases and then utilizes the upper confidence bound (UCB) bandit algorithm in its final phase. An interesting feature of $\mathtt{UE}\text{-}\mathtt{UCB}\text{++}$ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound.
APA
Mayekar, P., Scarlett, J. & Tan, V.Y.F.. (2023). Communication-Constrained Bandits under Additive Gaussian Noise. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:24236-24250 Available from https://proceedings.mlr.press/v202/mayekar23a.html.

Related Material