On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs

Yuanzhou Chen; Jiafan He; Quanquan Gu

On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs

Yuanzhou Chen, Jiafan He, Quanquan Gu

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3149-3183, 2022.

Abstract

We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an

$\Tilde{O}\big(d^2/((1-\gamma)^4\epsilon^2)+1/((1-\gamma)^6\epsilon^2)\big)$ uniform-PAC sample complexity, where

$d$ is the dimension of the feature mapping,

$\gamma \in(0,1)$ is the discount factor of the MDP and

$\epsilon$ is the accuracy parameter. To the best of our knowledge, this is the first

$\tilde{O}(1/\epsilon^2)$ sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-chen22f,
  title = 	 {On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel {MDP}s},
  author =       {Chen, Yuanzhou and He, Jiafan and Gu, Quanquan},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {3149--3183},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/chen22f/chen22f.pdf},
  url = 	 {https://proceedings.mlr.press/v162/chen22f.html},
  abstract = 	 {We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an $\Tilde{O}\big(d^2/((1-\gamma)^4\epsilon^2)+1/((1-\gamma)^6\epsilon^2)\big)$ uniform-PAC sample complexity, where $d$ is the dimension of the feature mapping, $\gamma \in(0,1)$ is the discount factor of the MDP and $\epsilon$ is the accuracy parameter. To the best of our knowledge, this is the first $\tilde{O}(1/\epsilon^2)$ sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).}
}

Endnote

%0 Conference Paper
%T On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs
%A Yuanzhou Chen
%A Jiafan He
%A Quanquan Gu
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-chen22f
%I PMLR
%P 3149--3183
%U https://proceedings.mlr.press/v162/chen22f.html
%V 162
%X We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an $\Tilde{O}\big(d^2/((1-\gamma)^4\epsilon^2)+1/((1-\gamma)^6\epsilon^2)\big)$ uniform-PAC sample complexity, where $d$ is the dimension of the feature mapping, $\gamma \in(0,1)$ is the discount factor of the MDP and $\epsilon$ is the accuracy parameter. To the best of our knowledge, this is the first $\tilde{O}(1/\epsilon^2)$ sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).

APA


Chen, Y., He, J. & Gu, Q.. (2022). On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3149-3183 Available from https://proceedings.mlr.press/v162/chen22f.html.

Related Material

Download PDF