Delay and Cooperation in Nonstochastic Bandits

Nicol‘o Cesa-Bianchi; Claudio Gentile; Yishay Mansour; Alberto Minora

Delay and Cooperation in Nonstochastic Bandits

Nicol‘o Cesa-Bianchi, Claudio Gentile, Yishay Mansour, Alberto Minora

29th Annual Conference on Learning Theory, PMLR 49:605-622, 2016.

Abstract

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

Cite this Paper

BibTeX


@InProceedings{pmlr-v49-cesa-bianchi16,
  title = 	 {Delay and Cooperation in Nonstochastic Bandits},
  author = 	 {Cesa-Bianchi, Nicol‘o and Gentile, Claudio and Mansour, Yishay and Minora, Alberto},
  booktitle = 	 {29th Annual Conference on Learning Theory},
  pages = 	 {605--622},
  year = 	 {2016},
  editor = 	 {Feldman, Vitaly and Rakhlin, Alexander and Shamir, Ohad},
  volume = 	 {49},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Columbia University, New York, New York, USA},
  month = 	 {23--26 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v49/cesa-bianchi16.pdf},
  url = 	 {https://proceedings.mlr.press/v49/cesa-bianchi16.html},
  abstract = 	 {We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay. }
}

Endnote

%0 Conference Paper
%T Delay and Cooperation in Nonstochastic Bandits
%A Nicol‘o Cesa-Bianchi
%A Claudio Gentile
%A Yishay Mansour
%A Alberto Minora
%B 29th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2016
%E Vitaly Feldman
%E Alexander Rakhlin
%E Ohad Shamir	
%F pmlr-v49-cesa-bianchi16
%I PMLR
%P 605--622
%U https://proceedings.mlr.press/v49/cesa-bianchi16.html
%V 49
%X We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

RIS


TY  - CPAPER
TI  - Delay and Cooperation in Nonstochastic Bandits
AU  - Nicol‘o Cesa-Bianchi
AU  - Claudio Gentile
AU  - Yishay Mansour
AU  - Alberto Minora
BT  - 29th Annual Conference on Learning Theory
DA  - 2016/06/06
ED  - Vitaly Feldman
ED  - Alexander Rakhlin
ED  - Ohad Shamir	
ID  - pmlr-v49-cesa-bianchi16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 49
SP  - 605
EP  - 622
L1  - http://proceedings.mlr.press/v49/cesa-bianchi16.pdf
UR  - https://proceedings.mlr.press/v49/cesa-bianchi16.html
AB  - We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, where d is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order \sqrt\left(d+1 + \fracKN\alpha_≤d\right)(T\ln K), where \alpha_≤d is the independence number of the d-th power of the communication graph G. We then show that for any connected graph, for d=\sqrtK the regret bound is K^1/4\sqrtT, strictly better than the minimax regret \sqrtKT for noncooperating agents. More informed choices of d lead to bounds which are arbitrarily close to the full information minimax regret \sqrtT\ln K when G is dense. When G has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in G, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay. 
ER  -

APA


Cesa-Bianchi, N., Gentile, C., Mansour, Y. & Minora, A.. (2016). Delay and Cooperation in Nonstochastic Bandits. 29th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 49:605-622 Available from https://proceedings.mlr.press/v49/cesa-bianchi16.html.

Related Material

Download PDF