Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Shuang Qiu; Xiaohan Wei; Jieping Ye; Zhaoran Wang; Zhuoran Yang

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Shuang Qiu, Xiaohan Wei, Jieping Ye, Zhaoran Wang, Zhuoran Yang

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8715-8725, 2021.

Abstract

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bounds after $T$ steps in a two-agent competitive game scenario. The regret of each player is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{T})$.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-qiu21b,
  title = 	 {Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions},
  author =       {Qiu, Shuang and Wei, Xiaohan and Ye, Jieping and Wang, Zhaoran and Yang, Zhuoran},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {8715--8725},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/qiu21b/qiu21b.pdf},
  url = 	 {https://proceedings.mlr.press/v139/qiu21b.html},
  abstract = 	 {While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bounds after $T$ steps in a two-agent competitive game scenario. The regret of each player is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{T})$.}
}

Endnote

%0 Conference Paper
%T Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions
%A Shuang Qiu
%A Xiaohan Wei
%A Jieping Ye
%A Zhaoran Wang
%A Zhuoran Yang
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-qiu21b
%I PMLR
%P 8715--8725
%U https://proceedings.mlr.press/v139/qiu21b.html
%V 139
%X While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bounds after $T$ steps in a two-agent competitive game scenario. The regret of each player is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{T})$.

APA

Qiu, S., Wei, X., Ye, J., Wang, Z. & Yang, Z.. (2021). Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8715-8725 Available from https://proceedings.mlr.press/v139/qiu21b.html.

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Abstract

Cite this Paper

Related Material