VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Alekh Agarwal; Yujia Jin; Tong Zhang

VO $Q$ L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Alekh Agarwal, Yujia Jin, Tong Zhang

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:987-1063, 2023.

Abstract

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic

$Q$ -Learning (VO

$Q$ L), based on

$Q$ -learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO

$Q$ L achieves

$\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over

$T$ episodes for a horizon

$H$ MDP under (

$d$ -dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

Cite this Paper

BibTeX


@InProceedings{pmlr-v195-agarwal23a,
  title = 	 {VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation},
  author =       {Agarwal, Alekh and Jin, Yujia and Zhang, Tong},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {987--1063},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/agarwal23a/agarwal23a.pdf},
  url = 	 {https://proceedings.mlr.press/v195/agarwal23a.html},
  abstract = 	 {We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on  $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.}
}

Endnote

%0 Conference Paper
%T VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation
%A Alekh Agarwal
%A Yujia Jin
%A Tong Zhang
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-agarwal23a
%I PMLR
%P 987--1063
%U https://proceedings.mlr.press/v195/agarwal23a.html
%V 195
%X We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on  $Q$-learning and bound its regret assuming closure under Bellman backups, and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\widetilde{O}(d\sqrt{TH}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

APA


Agarwal, A., Jin, Y. & Zhang, T.. (2023). VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:987-1063 Available from https://proceedings.mlr.press/v195/agarwal23a.html.

Related Material

Download PDF

VOQQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Abstract

Cite this Paper

Related Material

VO $Q$ L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation