[edit]
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12793-12802, 2021.
Abstract
Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm which makes use of the feature mapping and obtains a ˜O(d√T/(1−γ)2) regret, where d is the dimension of the feature space, T is the time horizon and γ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by Ω(d√T/(1−γ)1.5). Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a (1−γ)−0.5 factor.