Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Junyan Liu; Haipeng Luo; Zihan Zhang; Lillian J. Ratliff

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4819-4856, 2026.

Abstract

We study online learning in two-player uninformed Markov games, where the opponent’s actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $\widetilde{O}(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $\widetilde{O}(\sqrt{K})$ external regret is well-known to be achievable, their result is still the \textit{worse} rate $\widetilde{O}(K^{2/3})$ on a \textit{weaker} metric. In this work, we fully address both limitations. First, we introduce \textit{empirical Nash-value regret}, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $\widetilde{O} \big(\min{\sqrt{K} + (CK)^{1/3}, \sqrt{LK}}\big)$ regret bound, where $C$ quantifies the “variance” of the opponent’s policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes—$\widetilde{O}(\sqrt{K})$ external regret when the opponent is fixed and $\widetilde{O}(K^{2/3})$ Nash-value regret in the worst case—but also smoothly interpolate between these extremes by automatically adapting to the opponent’s non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $\widetilde{O}(\eta C + \sqrt{K/\eta})$ regret bound, where $\eta$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $\eta$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-liu26b,
  title = 	 {Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation},
  author =       {Liu, Junyan and Luo, Haipeng and Zhang, Zihan and Ratliff, Lillian J.},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {4819--4856},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/liu26b/liu26b.pdf},
  url = 	 {https://proceedings.mlr.press/v336/liu26b.html},
  abstract = 	 {We study online learning in two-player uninformed Markov games, where the opponent’s actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$.  They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $\widetilde{O}(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $\widetilde{O}(\sqrt{K})$ external regret is well-known to be achievable, their result is still the \textit{worse} rate $\widetilde{O}(K^{2/3})$ on a \textit{weaker} metric. In this work, we fully address both limitations. First, we introduce \textit{empirical Nash-value regret},  a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $\widetilde{O} \big(\min{\sqrt{K} + (CK)^{1/3}, \sqrt{LK}}\big)$ regret bound, where $C$ quantifies the “variance” of the opponent’s policies and $L$ denotes the number of policy switches (both at most $O(K)$).  Therefore, our results not only recover the two extremes—$\widetilde{O}(\sqrt{K})$ external regret when the opponent is fixed and $\widetilde{O}(K^{2/3})$ Nash-value regret in the worst case—but also smoothly interpolate between these extremes by automatically adapting to the opponent’s non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $\widetilde{O}(\eta C + \sqrt{K/\eta})$ regret bound, where $\eta$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $\eta$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.}
}

Endnote

%0 Conference Paper
%T Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation
%A Junyan Liu
%A Haipeng Luo
%A Zihan Zhang
%A Lillian J. Ratliff
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-liu26b
%I PMLR
%P 4819--4856
%U https://proceedings.mlr.press/v336/liu26b.html
%V 336
%X We study online learning in two-player uninformed Markov games, where the opponent’s actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$.  They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $\widetilde{O}(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $\widetilde{O}(\sqrt{K})$ external regret is well-known to be achievable, their result is still the \textit{worse} rate $\widetilde{O}(K^{2/3})$ on a \textit{weaker} metric. In this work, we fully address both limitations. First, we introduce \textit{empirical Nash-value regret},  a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $\widetilde{O} \big(\min{\sqrt{K} + (CK)^{1/3}, \sqrt{LK}}\big)$ regret bound, where $C$ quantifies the “variance” of the opponent’s policies and $L$ denotes the number of policy switches (both at most $O(K)$).  Therefore, our results not only recover the two extremes—$\widetilde{O}(\sqrt{K})$ external regret when the opponent is fixed and $\widetilde{O}(K^{2/3})$ Nash-value regret in the worst case—but also smoothly interpolate between these extremes by automatically adapting to the opponent’s non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $\widetilde{O}(\eta C + \sqrt{K/\eta})$ regret bound, where $\eta$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $\eta$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

APA

Liu, J., Luo, H., Zhang, Z. & Ratliff, L.J.. (2026). Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4819-4856 Available from https://proceedings.mlr.press/v336/liu26b.html.

Related Material

Download PDF