Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Zihan Zhang, Yuan Zhou, Xiangyang Ji
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12653-12662, 2021.

Abstract

In this paper we consider the problem of learning an ϵ-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ(0,1), and an approximation threshold ϵ>0, we provide a model-free algorithm to learn an ϵ-optimal policy with sample complexity ˜O(SAln(1/p)ϵ2(1γ)5.5) \footnote{In this work, the notation ˜O() hides poly-logarithmic factors of S,A,1/(1γ), and 1/ϵ.} and success probability (1p). For small enough ϵ, we show an improved algorithm with sample complexity ˜O(SAln(1/p)ϵ2(1γ)3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-zhang21ab, title = {Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity}, author = {Zhang, Zihan and Zhou, Yuan and Ji, Xiangyang}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {12653--12662}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/zhang21ab/zhang21ab.pdf}, url = {https://proceedings.mlr.press/v139/zhang21ab.html}, abstract = {In this paper we consider the problem of learning an $\epsilon$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $\gamma \in (0,1)$, and an approximation threshold $\epsilon > 0$, we provide a model-free algorithm to learn an $\epsilon$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}})$ \footnote{In this work, the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-\gamma)$, and $1/\epsilon$.} and success probability $(1-p)$. For small enough $\epsilon$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.} }
Endnote
%0 Conference Paper %T Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity %A Zihan Zhang %A Yuan Zhou %A Xiangyang Ji %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-zhang21ab %I PMLR %P 12653--12662 %U https://proceedings.mlr.press/v139/zhang21ab.html %V 139 %X In this paper we consider the problem of learning an $\epsilon$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $\gamma \in (0,1)$, and an approximation threshold $\epsilon > 0$, we provide a model-free algorithm to learn an $\epsilon$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}})$ \footnote{In this work, the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-\gamma)$, and $1/\epsilon$.} and success probability $(1-p)$. For small enough $\epsilon$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
APA
Zhang, Z., Zhou, Y. & Ji, X.. (2021). Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12653-12662 Available from https://proceedings.mlr.press/v139/zhang21ab.html.

Related Material