[edit]

# VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

*Proceedings of Thirty Sixth Conference on Learning Theory*, PMLR 195:987-1063, 2023.

#### Abstract

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.