Frequentist Regret Bounds for Randomized LeastSquares Value Iteration
[edit]
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:19541964, 2020.
Abstract
We consider the explorationexploitation dilemma in finitehorizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimisticallyinitialized variant of the popular randomized leastsquares value iteration (RLSVI), a modelfree algorithm where exploration is induced by perturbing the leastsquares approximation of the actionvalue function. Under the assumption that the Markov decision process has lowrank transition dynamics, we prove that the frequentist regret of RLSVI is upperbounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of steps. To the best of our knowledge, this is the first frequentist regret analysis for randomized exploration with function approximation.
Related Material


