A Natural Actor-Critic Framework for Zero-Sum Markov Games

Ahmet Alacaoglu, Luca Viano, Niao He, Volkan Cevher
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:307-366, 2022.

Abstract

We introduce algorithms based on natural actor-critic and analyze their sample complexity for solving two player zero-sum Markov games in the tabular case. Our results improve the best-known sample complexities of policy gradient/actor-critic methods for convergence to Nash equilibrium in the multi-agent setting. We use the error propagation scheme in approximate dynamic programming, recent advances for global convergence of policy gradient methods, temporal difference learning, and techniques from stochastic primal-dual optimization. Our algorithms feature two stages, requiring agents to agree on an etiquette before starting their interactions, which is feasible for instance in self-play. However, the agents only access to joint reward and joint next state and not to each other’s actions or policies. Our complexity results match the best-known results for global convergence of policy gradient algorithms for single agent RL. We provide numerical verification of our methods for a two player bandit environment and a two player game, Alesia. We observe improved empirical performance as compared to the recently proposed optimistic gradient descent-ascent variant for Markov games.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-alacaoglu22a, title = {A Natural Actor-Critic Framework for Zero-Sum {M}arkov Games}, author = {Alacaoglu, Ahmet and Viano, Luca and He, Niao and Cevher, Volkan}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {307--366}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/alacaoglu22a/alacaoglu22a.pdf}, url = {https://proceedings.mlr.press/v162/alacaoglu22a.html}, abstract = {We introduce algorithms based on natural actor-critic and analyze their sample complexity for solving two player zero-sum Markov games in the tabular case. Our results improve the best-known sample complexities of policy gradient/actor-critic methods for convergence to Nash equilibrium in the multi-agent setting. We use the error propagation scheme in approximate dynamic programming, recent advances for global convergence of policy gradient methods, temporal difference learning, and techniques from stochastic primal-dual optimization. Our algorithms feature two stages, requiring agents to agree on an etiquette before starting their interactions, which is feasible for instance in self-play. However, the agents only access to joint reward and joint next state and not to each other’s actions or policies. Our complexity results match the best-known results for global convergence of policy gradient algorithms for single agent RL. We provide numerical verification of our methods for a two player bandit environment and a two player game, Alesia. We observe improved empirical performance as compared to the recently proposed optimistic gradient descent-ascent variant for Markov games.} }
Endnote
%0 Conference Paper %T A Natural Actor-Critic Framework for Zero-Sum Markov Games %A Ahmet Alacaoglu %A Luca Viano %A Niao He %A Volkan Cevher %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-alacaoglu22a %I PMLR %P 307--366 %U https://proceedings.mlr.press/v162/alacaoglu22a.html %V 162 %X We introduce algorithms based on natural actor-critic and analyze their sample complexity for solving two player zero-sum Markov games in the tabular case. Our results improve the best-known sample complexities of policy gradient/actor-critic methods for convergence to Nash equilibrium in the multi-agent setting. We use the error propagation scheme in approximate dynamic programming, recent advances for global convergence of policy gradient methods, temporal difference learning, and techniques from stochastic primal-dual optimization. Our algorithms feature two stages, requiring agents to agree on an etiquette before starting their interactions, which is feasible for instance in self-play. However, the agents only access to joint reward and joint next state and not to each other’s actions or policies. Our complexity results match the best-known results for global convergence of policy gradient algorithms for single agent RL. We provide numerical verification of our methods for a two player bandit environment and a two player game, Alesia. We observe improved empirical performance as compared to the recently proposed optimistic gradient descent-ascent variant for Markov games.
APA
Alacaoglu, A., Viano, L., He, N. & Cevher, V.. (2022). A Natural Actor-Critic Framework for Zero-Sum Markov Games. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:307-366 Available from https://proceedings.mlr.press/v162/alacaoglu22a.html.

Related Material