[edit]
On the Sample Complexity of Learning Infinite-horizon Discounted Linear Kernel MDPs
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3149-3183, 2022.
Abstract
We study reinforcement learning for infinite-horizon discounted linear kernel MDPs, where the transition probability function is linear in a predefined feature mapping. Existing UCLK \citep{zhou2020provably} algorithm for this setting only has a regret guarantee, which cannot lead to a tight sample complexity bound. In this paper, we extend the uniform-PAC sample complexity from episodic setting to the infinite-horizon discounted setting, and propose a novel algorithm dubbed UPAC-UCLK that achieves an \TildeO(d2/((1−γ)4ϵ2)+1/((1−γ)6ϵ2)) uniform-PAC sample complexity, where d is the dimension of the feature mapping, γ∈(0,1) is the discount factor of the MDP and ϵ is the accuracy parameter. To the best of our knowledge, this is the first ˜O(1/ϵ2) sample complexity bound for learning infinite-horizon discounted MDPs with linear function approximation (without access to the generative model).