Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Dongruo Zhou; Jiafan He; Quanquan Gu

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Dongruo Zhou, Jiafan He, Quanquan Gu

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12793-12802, 2021.

Abstract

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm which makes use of the feature mapping and obtains a

$\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where

$d$ is the dimension of the feature space,

$T$ is the time horizon and

$\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by

$\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ . Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a

$(1-\gamma)^{-0.5}$ factor.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-zhou21a,
  title = 	 {Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping},
  author =       {Zhou, Dongruo and He, Jiafan and Gu, Quanquan},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {12793--12802},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/zhou21a/zhou21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/zhou21a.html},
  abstract = 	 {Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm which makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.}
}

Endnote

%0 Conference Paper
%T Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
%A Dongruo Zhou
%A Jiafan He
%A Quanquan Gu
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-zhou21a
%I PMLR
%P 12793--12802
%U https://proceedings.mlr.press/v139/zhou21a.html
%V 139
%X Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm which makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.

APA


Zhou, D., He, J. & Gu, Q.. (2021). Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12793-12802 Available from https://proceedings.mlr.press/v139/zhou21a.html.

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Abstract

Cite this Paper

Related Material