[edit]
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12653-12662, 2021.
Abstract
In this paper we consider the problem of learning an ϵ-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ∈(0,1), and an approximation threshold ϵ>0, we provide a model-free algorithm to learn an ϵ-optimal policy with sample complexity ˜O(SAln(1/p)ϵ2(1−γ)5.5) \footnote{In this work, the notation ˜O(⋅) hides poly-logarithmic factors of S,A,1/(1−γ), and 1/ϵ.} and success probability (1−p). For small enough ϵ, we show an improved algorithm with sample complexity ˜O(SAln(1/p)ϵ2(1−γ)3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.