Delay and Cooperation in Nonstochastic Bandits

Nicol‘o Cesa-Bianchi, Claudio Gentile, Yishay Mansour, Alberto Minora
; 29th Annual Conference on Learning Theory, PMLR 49:605-622, 2016.

Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

Cite this Paper


BibTeX
@InProceedings{pmlr-v49-cesa-bianchi16, title = {Delay and Cooperation in Nonstochastic Bandits}, author = {Nicol‘o Cesa-Bianchi and Claudio Gentile and Yishay Mansour and Alberto Minora}, booktitle = {29th Annual Conference on Learning Theory}, pages = {605--622}, year = {2016}, editor = {Vitaly Feldman and Alexander Rakhlin and Ohad Shamir}, volume = {49}, series = {Proceedings of Machine Learning Research}, address = {Columbia University, New York, New York, USA}, month = {23--26 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v49/cesa-bianchi16.pdf}, url = {http://proceedings.mlr.press/v49/cesa-bianchi16.html}, abstract = {We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay. } }
Endnote
%0 Conference Paper %T Delay and Cooperation in Nonstochastic Bandits %A Nicol‘o Cesa-Bianchi %A Claudio Gentile %A Yishay Mansour %A Alberto Minora %B 29th Annual Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2016 %E Vitaly Feldman %E Alexander Rakhlin %E Ohad Shamir %F pmlr-v49-cesa-bianchi16 %I PMLR %J Proceedings of Machine Learning Research %P 605--622 %U http://proceedings.mlr.press %V 49 %W PMLR %X We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.
RIS
TY - CPAPER TI - Delay and Cooperation in Nonstochastic Bandits AU - Nicol‘o Cesa-Bianchi AU - Claudio Gentile AU - Yishay Mansour AU - Alberto Minora BT - 29th Annual Conference on Learning Theory PY - 2016/06/06 DA - 2016/06/06 ED - Vitaly Feldman ED - Alexander Rakhlin ED - Ohad Shamir ID - pmlr-v49-cesa-bianchi16 PB - PMLR SP - 605 DP - PMLR EP - 622 L1 - http://proceedings.mlr.press/v49/cesa-bianchi16.pdf UR - http://proceedings.mlr.press/v49/cesa-bianchi16.html AB - We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay. ER -
APA
Cesa-Bianchi, N., Gentile, C., Mansour, Y. & Minora, A.. (2016). Delay and Cooperation in Nonstochastic Bandits. 29th Annual Conference on Learning Theory, in PMLR 49:605-622

Related Material