Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning

Ian Gemp, Andreas Alexander Haupt, Luke Marris, Siqi Liu, Georgios Piliouras
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:19137-19160, 2025.

Abstract

Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner’s dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gemp25a, title = {Convex {M}arkov Games: A New Frontier for Multi-Agent Reinforcement Learning}, author = {Gemp, Ian and Haupt, Andreas Alexander and Marris, Luke and Liu, Siqi and Piliouras, Georgios}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {19137--19160}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gemp25a/gemp25a.pdf}, url = {https://proceedings.mlr.press/v267/gemp25a.html}, abstract = {Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner’s dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.} }
Endnote
%0 Conference Paper %T Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning %A Ian Gemp %A Andreas Alexander Haupt %A Luke Marris %A Siqi Liu %A Georgios Piliouras %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gemp25a %I PMLR %P 19137--19160 %U https://proceedings.mlr.press/v267/gemp25a.html %V 267 %X Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner’s dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.
APA
Gemp, I., Haupt, A.A., Marris, L., Liu, S. & Piliouras, G.. (2025). Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:19137-19160 Available from https://proceedings.mlr.press/v267/gemp25a.html.

Related Material