[edit]
Regret Minimization with Adaptive Opponents in Repeated Games
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4857-4858, 2026.
Abstract
In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents whose strategies may depend on the histories of play. The classical online learning metric of \emph{external regret} does not fully capture such adaptivity, since it compares against decisions while treating the loss sequence as fixed. To account for the counterfactual reasoning of the players, we introduce a new metric, \texttt{Repeated Policy Regret (RP-Regret)}, specific to this game-theoretic setting, which measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared with existing regret notions in adaptive environments, \texttt{RP-Regret} allows stronger dynamic comparators and less restricted opponents, while still enabling the learning of better equilibria when all players minimize it. We first identify necessary conditions for achieving sublinear \texttt{RP-Regret}. The comparator strategies must have sublinear accumulated variation, and both the comparator and the opponents must have imperfect recall. Without these conditions, sublinear \texttt{RP-Regret} is impossible to achieve in general. We then provide additional sufficient conditions and algorithms for minimizing \texttt{RP-Regret}. A key challenge is that \texttt{RP-Regret} is \emph{nonconvex} in the strategy space by definition. We address this challenge through three approaches. The first approach uses a nonconvex optimization oracle, as in prior work on online nonconvex learning. The second approach minimizes a convex \emph{linearized} surrogate at each iteration, which yields the minimization of a local variant of \texttt{RP-Regret}. The third approach directly minimizes \texttt{RP-Regret} when the opponents change their strategies slowly, by reformulating the repeated game as a Markov game and optimizing over occupancy measures. Finally, we show that when all players can run algorithms to minimize the \texttt{RP-Regret} or its linearized variant, certain subgame-perfect equilibria of the repeated game can be learned. We also provide experiments to show that these regret notions can lead to more cooperative outcomes with higher utility in games such as the Stag Hunt.