Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games

Julien Perolat, Bruno Scherrer, Bilal Piot, Olivier Pietquin
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:1321-1329, 2015.

Abstract

This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in L_p-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteration). We show that we can achieve a stationary policy which is \frac2γ(1 - γ)^2 ε+ \frac1(1 - γ)^2ε’-optimal, where εis the value function approximation error and ε’ is the approximate greedy operator error. In addition, we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum stochastic games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes in a batch setting) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia.

Cite this Paper


BibTeX
@InProceedings{pmlr-v37-perolat15, title = {Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games}, author = {Perolat, Julien and Scherrer, Bruno and Piot, Bilal and Pietquin, Olivier}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning}, pages = {1321--1329}, year = {2015}, editor = {Bach, Francis and Blei, David}, volume = {37}, series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/perolat15.pdf}, url = {https://proceedings.mlr.press/v37/perolat15.html}, abstract = {This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in L_p-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteration). We show that we can achieve a stationary policy which is \frac2γ(1 - γ)^2 ε+ \frac1(1 - γ)^2ε’-optimal, where εis the value function approximation error and ε’ is the approximate greedy operator error. In addition, we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum stochastic games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes in a batch setting) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia.} }
Endnote
%0 Conference Paper %T Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games %A Julien Perolat %A Bruno Scherrer %A Bilal Piot %A Olivier Pietquin %B Proceedings of the 32nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Francis Bach %E David Blei %F pmlr-v37-perolat15 %I PMLR %P 1321--1329 %U https://proceedings.mlr.press/v37/perolat15.html %V 37 %X This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in L_p-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteration). We show that we can achieve a stationary policy which is \frac2γ(1 - γ)^2 ε+ \frac1(1 - γ)^2ε’-optimal, where εis the value function approximation error and ε’ is the approximate greedy operator error. In addition, we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum stochastic games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes in a batch setting) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia.
RIS
TY - CPAPER TI - Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games AU - Julien Perolat AU - Bruno Scherrer AU - Bilal Piot AU - Olivier Pietquin BT - Proceedings of the 32nd International Conference on Machine Learning DA - 2015/06/01 ED - Francis Bach ED - David Blei ID - pmlr-v37-perolat15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 37 SP - 1321 EP - 1329 L1 - http://proceedings.mlr.press/v37/perolat15.pdf UR - https://proceedings.mlr.press/v37/perolat15.html AB - This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in L_p-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteration). We show that we can achieve a stationary policy which is \frac2γ(1 - γ)^2 ε+ \frac1(1 - γ)^2ε’-optimal, where εis the value function approximation error and ε’ is the approximate greedy operator error. In addition, we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum stochastic games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes in a batch setting) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia. ER -
APA
Perolat, J., Scherrer, B., Piot, B. & Pietquin, O.. (2015). Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:1321-1329 Available from https://proceedings.mlr.press/v37/perolat15.html.

Related Material