[edit]
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:3743-3774, 2024.
Abstract
We consider the problem of learning an ε-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Given access to a generative model, we achieve rate-optimal sample complexity by performing a simple, \emph{perturbed} version of least-squares value iteration with orthogonal trigonometric polynomials as features. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our ˜O(ϵ−2−d/(ν+1)) sample complexity, where d is the dimension of the state-action space and ν the order of smoothness, recovers the state-of-the-art result of discretization approaches for the special case of Lipschitz MDPs (ν=0). At the same time, for ν→∞, it recovers and greatly generalizes the O(ϵ−2) rate of low-rank MDPs, which are more amenable to regression approaches. In this sense, our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.