Exploration Analysis in Finite-Horizon Turn-based Stochastic Games

Jialian Li, Yichi Zhou, Tongzheng Ren, Jun Zhu
Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR 124:201-210, 2020.

Abstract

Exploration and exploitation trade-off is one of the key concerns in reinforcement learning. Previous work on one-player Markov Decision Processes has reached near-optimal results for both PAC and high probability regret guarantees. However, such an analysis is lacking for the more complex stochastic games with multi-players, where all players aim to find an approximate Nash Equilibrium. In this work, we address the exploration issue for the $N$-player finite-horizon turn-based stochastic games (FTSG). We propose a framework, \textit{Upper Bounding the Values for Players} (UBVP), to guide exploration in FTSGs. UBVP leverages the key insight that players choose the optimal policy conditioning on the policies of the others simultaneously; thus players can explore \textit{in the face of uncertainty} and get close to the Nash Equilibrium. Based on UBVP, we present two provable algorithms. One is \textit{Uniform}-PAC with a sample complexity of $\tilde{O}(1/\epsilon^2)$ to get an $\epsilon$-Nash Equilibrium for arbitrary $\epsilon>0$, and the other has a cumulative exploitability of $\tilde{O}(\sqrt{T})$ with high probability.

Cite this Paper


BibTeX
@InProceedings{pmlr-v124-li20a, title = {Exploration Analysis in Finite-Horizon Turn-based Stochastic Games}, author = {Li, Jialian and Zhou, Yichi and Ren, Tongzheng and Zhu, Jun}, booktitle = {Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)}, pages = {201--210}, year = {2020}, editor = {Peters, Jonas and Sontag, David}, volume = {124}, series = {Proceedings of Machine Learning Research}, month = {03--06 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v124/li20a/li20a.pdf}, url = {https://proceedings.mlr.press/v124/li20a.html}, abstract = {Exploration and exploitation trade-off is one of the key concerns in reinforcement learning. Previous work on one-player Markov Decision Processes has reached near-optimal results for both PAC and high probability regret guarantees. However, such an analysis is lacking for the more complex stochastic games with multi-players, where all players aim to find an approximate Nash Equilibrium. In this work, we address the exploration issue for the $N$-player finite-horizon turn-based stochastic games (FTSG). We propose a framework, \textit{Upper Bounding the Values for Players} (UBVP), to guide exploration in FTSGs. UBVP leverages the key insight that players choose the optimal policy conditioning on the policies of the others simultaneously; thus players can explore \textit{in the face of uncertainty} and get close to the Nash Equilibrium. Based on UBVP, we present two provable algorithms. One is \textit{Uniform}-PAC with a sample complexity of $\tilde{O}(1/\epsilon^2)$ to get an $\epsilon$-Nash Equilibrium for arbitrary $\epsilon>0$, and the other has a cumulative exploitability of $\tilde{O}(\sqrt{T})$ with high probability.} }
Endnote
%0 Conference Paper %T Exploration Analysis in Finite-Horizon Turn-based Stochastic Games %A Jialian Li %A Yichi Zhou %A Tongzheng Ren %A Jun Zhu %B Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI) %C Proceedings of Machine Learning Research %D 2020 %E Jonas Peters %E David Sontag %F pmlr-v124-li20a %I PMLR %P 201--210 %U https://proceedings.mlr.press/v124/li20a.html %V 124 %X Exploration and exploitation trade-off is one of the key concerns in reinforcement learning. Previous work on one-player Markov Decision Processes has reached near-optimal results for both PAC and high probability regret guarantees. However, such an analysis is lacking for the more complex stochastic games with multi-players, where all players aim to find an approximate Nash Equilibrium. In this work, we address the exploration issue for the $N$-player finite-horizon turn-based stochastic games (FTSG). We propose a framework, \textit{Upper Bounding the Values for Players} (UBVP), to guide exploration in FTSGs. UBVP leverages the key insight that players choose the optimal policy conditioning on the policies of the others simultaneously; thus players can explore \textit{in the face of uncertainty} and get close to the Nash Equilibrium. Based on UBVP, we present two provable algorithms. One is \textit{Uniform}-PAC with a sample complexity of $\tilde{O}(1/\epsilon^2)$ to get an $\epsilon$-Nash Equilibrium for arbitrary $\epsilon>0$, and the other has a cumulative exploitability of $\tilde{O}(\sqrt{T})$ with high probability.
APA
Li, J., Zhou, Y., Ren, T. & Zhu, J.. (2020). Exploration Analysis in Finite-Horizon Turn-based Stochastic Games. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), in Proceedings of Machine Learning Research 124:201-210 Available from https://proceedings.mlr.press/v124/li20a.html.

Related Material