Thompson sampling for Markov games with piecewise stationary opponent policies
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:738-748, 2021.
Reinforcement learning problems with multiple agents pose the challenge of efficiently adapting to nonstationary dynamics arising from other agents’ strategic behavior. Although several algorithms exist for these problems with promising empirical results, regret analysis and efficient use of other-agent models in general-sum games is very limited. We propose an algorithm (TSMG) for general-sum Markov games against agents that switch between several stationary policies, combining change detection with Thompson sampling to learn parametric models of these policies. Under standard assumptions for parametric Markov decision process (MDP) learning, the expected regret of TSMG in the worst case over policy parameters and switch schedules is near-optimal in time and number of switches, up to logarithmic factors. Our experiments on simulated games show that TSMG can outperform standard Thompson sampling and a version of Thompson sampling with a static reset schedule, despite the violation of an assumption that the MDPs induced by the other player are ergodic.