[edit]
Coordination without communication: optimal regret in two players multi-armed bandits
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:916-939, 2020.
Abstract
We consider two agents playing simultaneously the same stochastic three-armed bandit problem. The two agents are cooperating but they cannot communicate. Under the assumption that shared randomness is available, we propose a strategy with no collisions at all between the players (with very high probability), and with near-optimal regret O(√Tlog(T)). We also argue that the extra logarithmic term √log(T) should be necessary by proving a lower bound for a full information variant of the problem.