Bandit Online Learning with Unknown Delays

Bingcong Li, Tianyi Chen, Georgios B. Giannakis
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:993-1002, 2019.

Abstract

This paper deals with bandit online learning, where feedback of unknown delay can emerge in non-stochastic multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function to become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with unknown delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. Delayed exploration, exploitation, and exponential (DEXP3) iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO with unknown delays, respectively. Based on a unifying analysis framework, it is established that both DEXP3 and DBGD guarantee an $\tilde{\cal O}\big( \sqrt{K(T+D)} \big)$ regret, where $D$ denotes the delay accumulated over $T$ slots, and $K$ represents the number of arms in MAB or the dimension of decision variables in BCO. Numerical tests using both synthetic and real data validate DEXP3 and DBGD.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-li19d, title = {Bandit Online Learning with Unknown Delays}, author = {Li, Bingcong and Chen, Tianyi and Giannakis, Georgios B.}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {993--1002}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/li19d/li19d.pdf}, url = {https://proceedings.mlr.press/v89/li19d.html}, abstract = {This paper deals with bandit online learning, where feedback of unknown delay can emerge in non-stochastic multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function to become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with unknown delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. Delayed exploration, exploitation, and exponential (DEXP3) iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO with unknown delays, respectively. Based on a unifying analysis framework, it is established that both DEXP3 and DBGD guarantee an $\tilde{\cal O}\big( \sqrt{K(T+D)} \big)$ regret, where $D$ denotes the delay accumulated over $T$ slots, and $K$ represents the number of arms in MAB or the dimension of decision variables in BCO. Numerical tests using both synthetic and real data validate DEXP3 and DBGD.} }
Endnote
%0 Conference Paper %T Bandit Online Learning with Unknown Delays %A Bingcong Li %A Tianyi Chen %A Georgios B. Giannakis %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-li19d %I PMLR %P 993--1002 %U https://proceedings.mlr.press/v89/li19d.html %V 89 %X This paper deals with bandit online learning, where feedback of unknown delay can emerge in non-stochastic multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function to become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with unknown delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. Delayed exploration, exploitation, and exponential (DEXP3) iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO with unknown delays, respectively. Based on a unifying analysis framework, it is established that both DEXP3 and DBGD guarantee an $\tilde{\cal O}\big( \sqrt{K(T+D)} \big)$ regret, where $D$ denotes the delay accumulated over $T$ slots, and $K$ represents the number of arms in MAB or the dimension of decision variables in BCO. Numerical tests using both synthetic and real data validate DEXP3 and DBGD.
APA
Li, B., Chen, T. & Giannakis, G.B.. (2019). Bandit Online Learning with Unknown Delays. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:993-1002 Available from https://proceedings.mlr.press/v89/li19d.html.

Related Material