On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs

Yuanzhou Chen, Jiafan He, Quanquan Gu
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3149-3183, 2022.

Abstract

We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an $\Tilde{O}\big(d^2/((1-\gamma)^4\epsilon^2)+1/((1-\gamma)^6\epsilon^2)\big)$ uniform-PAC sample complexity, where $d$ is the dimension of the feature mapping, $\gamma \in(0,1)$ is the discount factor of the MDP and $\epsilon$ is the accuracy parameter. To the best of our knowledge, this is the first $\tilde{O}(1/\epsilon^2)$ sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-chen22f, title = {On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel {MDP}s}, author = {Chen, Yuanzhou and He, Jiafan and Gu, Quanquan}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {3149--3183}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/chen22f/chen22f.pdf}, url = {https://proceedings.mlr.press/v162/chen22f.html}, abstract = {We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an $\Tilde{O}\big(d^2/((1-\gamma)^4\epsilon^2)+1/((1-\gamma)^6\epsilon^2)\big)$ uniform-PAC sample complexity, where $d$ is the dimension of the feature mapping, $\gamma \in(0,1)$ is the discount factor of the MDP and $\epsilon$ is the accuracy parameter. To the best of our knowledge, this is the first $\tilde{O}(1/\epsilon^2)$ sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).} }
Endnote
%0 Conference Paper %T On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs %A Yuanzhou Chen %A Jiafan He %A Quanquan Gu %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-chen22f %I PMLR %P 3149--3183 %U https://proceedings.mlr.press/v162/chen22f.html %V 162 %X We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an $\Tilde{O}\big(d^2/((1-\gamma)^4\epsilon^2)+1/((1-\gamma)^6\epsilon^2)\big)$ uniform-PAC sample complexity, where $d$ is the dimension of the feature mapping, $\gamma \in(0,1)$ is the discount factor of the MDP and $\epsilon$ is the accuracy parameter. To the best of our knowledge, this is the first $\tilde{O}(1/\epsilon^2)$ sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).
APA
Chen, Y., He, J. & Gu, Q.. (2022). On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3149-3183 Available from https://proceedings.mlr.press/v162/chen22f.html.

Related Material