VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Alekh Agarwal, Yujia Jin, Tong Zhang
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:987-1063, 2023.

Abstract

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v195-agarwal23a, title = {VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation}, author = {Agarwal, Alekh and Jin, Yujia and Zhang, Tong}, booktitle = {Proceedings of Thirty Sixth Conference on Learning Theory}, pages = {987--1063}, year = {2023}, editor = {Neu, Gergely and Rosasco, Lorenzo}, volume = {195}, series = {Proceedings of Machine Learning Research}, month = {12--15 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v195/agarwal23a/agarwal23a.pdf}, url = {https://proceedings.mlr.press/v195/agarwal23a.html}, abstract = {We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.} }
Endnote
%0 Conference Paper %T VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation %A Alekh Agarwal %A Yujia Jin %A Tong Zhang %B Proceedings of Thirty Sixth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Gergely Neu %E Lorenzo Rosasco %F pmlr-v195-agarwal23a %I PMLR %P 987--1063 %U https://proceedings.mlr.press/v195/agarwal23a.html %V 195 %X We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.
APA
Agarwal, A., Jin, Y. & Zhang, T.. (2023). VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:987-1063 Available from https://proceedings.mlr.press/v195/agarwal23a.html.

Related Material