[edit]

# Multiplayer Bandit Learning, from Competition to Cooperation

*Proceedings of Thirty Fourth Conference on Learning Theory*, PMLR 134:679-723, 2021.

#### Abstract

The stochastic multi-armed bandit model captures the tradeoff between exploration and exploitation. We study the effects of competition and cooperation on this tradeoff. Suppose there are two arms, one predictable and one risky, and two players, Alice and Bob. In every round, each player pulls an arm, receives the resulting reward, and observes the choice of the other player but not their reward. Alice’s utility is $\Gamma_A + \lambda \Gamma_B$ (and similarly for Bob), where $\Gamma_A$ is Alice’s total reward and $\lambda \in [-1, 1]$ is a cooperation parameter. At $\lambda = -1$ the players are competing in a zero-sum game, at $\lambda = 1$, their interests are aligned, and at $\lambda = 0$, they are neutral: each player’s utility is their own reward. The model is related to the economics literature on strategic experimentation, where usually players observe each other’s rewards. Suppose the predictable arm has success probability $p$ and the risky arm has prior $\mu$. If the discount factor is $\beta$, then the value of $p$ where a single player is indifferent between the arms is the Gittins index $g = g(\mu,\beta) > m$, where $m$ is the mean of the risky arm. Our first result answers, in this setting, a fundamental question posed by Rothschild \cite{rotschild}. We show that competing and neutral players eventually settle on the same arm (even though it may not be the best arm) in every Nash equilibrium, while this can fail for players with aligned interests. Moreover, we show that \emph{competing players} explore \emph{less} than a single player: there is $p^* \in (m, g)$ so that for all $p > p^*$, the players stay at the predictable arm. However, the players are not myopic: they still explore for some $p > m$. On the other hand, \emph{cooperating players} (with $\lambda =1$) explore \emph{more} than a single player. We also show that \emph{neutral players} learn from each other, receiving strictly higher total rewards than they would playing alone, for all $ p\in (p^*, g)$, where $p^*$ is the threshold above which competing players do not explore.