Provably Sample Efficient Reinforcement Learning in Competitive Linear Quadratic Systems

Jingwei Zhang, Zhuoran Yang, Zhengyuan Zhou, Zhaoran Wang
Proceedings of the 3rd Conference on Learning for Dynamics and Control, PMLR 144:597-598, 2021.

Abstract

We study the infinite-horizon zero-sum linear quadratic (LQ) games, where the state transition is linear and the cost function is quadratic in states and actions of two players. In particular, we develop an adaptive algorithm that can properly trade off between exploration and exploitation of the unknown environment in LQ games based on the optimism-in-face-of-uncertainty (OFU) principle. We show that (i) the average regret of player $1$ (the min player) can be bounded by $\widetilde{\mathcal{O}}(1/\sqrt{T})$ against any fixed linear policy of the adversary (player $2$); (ii) the average cost of player $1$ also converges to the value of the game at a sublinear $\widetilde{\mathcal{O}}(1/\sqrt{T})$ rate if the adversary plays adaptively against player $1$ with the same algorithm, i.e., with self-play. To the best of our knowledge, this is the first time that a probably sample efficient reinforcement learning algorithm is proposed for zero-sum LQ games.

Cite this Paper


BibTeX
@InProceedings{pmlr-v144-zhang21a, title = {Provably Sample Efficient Reinforcement Learning in Competitive Linear Quadratic Systems}, author = {Zhang, Jingwei and Yang, Zhuoran and Zhou, Zhengyuan and Wang, Zhaoran}, booktitle = {Proceedings of the 3rd Conference on Learning for Dynamics and Control}, pages = {597--598}, year = {2021}, editor = {Jadbabaie, Ali and Lygeros, John and Pappas, George J. and A. Parrilo, Pablo and Recht, Benjamin and Tomlin, Claire J. and Zeilinger, Melanie N.}, volume = {144}, series = {Proceedings of Machine Learning Research}, month = {07 -- 08 June}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v144/zhang21a/zhang21a.pdf}, url = {https://proceedings.mlr.press/v144/zhang21a.html}, abstract = { We study the infinite-horizon zero-sum linear quadratic (LQ) games, where the state transition is linear and the cost function is quadratic in states and actions of two players. In particular, we develop an adaptive algorithm that can properly trade off between exploration and exploitation of the unknown environment in LQ games based on the optimism-in-face-of-uncertainty (OFU) principle. We show that (i) the average regret of player $1$ (the min player) can be bounded by $\widetilde{\mathcal{O}}(1/\sqrt{T})$ against any fixed linear policy of the adversary (player $2$); (ii) the average cost of player $1$ also converges to the value of the game at a sublinear $\widetilde{\mathcal{O}}(1/\sqrt{T})$ rate if the adversary plays adaptively against player $1$ with the same algorithm, i.e., with self-play. To the best of our knowledge, this is the first time that a probably sample efficient reinforcement learning algorithm is proposed for zero-sum LQ games.} }
Endnote
%0 Conference Paper %T Provably Sample Efficient Reinforcement Learning in Competitive Linear Quadratic Systems %A Jingwei Zhang %A Zhuoran Yang %A Zhengyuan Zhou %A Zhaoran Wang %B Proceedings of the 3rd Conference on Learning for Dynamics and Control %C Proceedings of Machine Learning Research %D 2021 %E Ali Jadbabaie %E John Lygeros %E George J. Pappas %E Pablo A. Parrilo %E Benjamin Recht %E Claire J. Tomlin %E Melanie N. Zeilinger %F pmlr-v144-zhang21a %I PMLR %P 597--598 %U https://proceedings.mlr.press/v144/zhang21a.html %V 144 %X We study the infinite-horizon zero-sum linear quadratic (LQ) games, where the state transition is linear and the cost function is quadratic in states and actions of two players. In particular, we develop an adaptive algorithm that can properly trade off between exploration and exploitation of the unknown environment in LQ games based on the optimism-in-face-of-uncertainty (OFU) principle. We show that (i) the average regret of player $1$ (the min player) can be bounded by $\widetilde{\mathcal{O}}(1/\sqrt{T})$ against any fixed linear policy of the adversary (player $2$); (ii) the average cost of player $1$ also converges to the value of the game at a sublinear $\widetilde{\mathcal{O}}(1/\sqrt{T})$ rate if the adversary plays adaptively against player $1$ with the same algorithm, i.e., with self-play. To the best of our knowledge, this is the first time that a probably sample efficient reinforcement learning algorithm is proposed for zero-sum LQ games.
APA
Zhang, J., Yang, Z., Zhou, Z. & Wang, Z.. (2021). Provably Sample Efficient Reinforcement Learning in Competitive Linear Quadratic Systems. Proceedings of the 3rd Conference on Learning for Dynamics and Control, in Proceedings of Machine Learning Research 144:597-598 Available from https://proceedings.mlr.press/v144/zhang21a.html.

Related Material