Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

Jiayi Huang, Han Zhong, Liwei Wang, Lin Yang
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3673-3681, 2024.

Abstract

To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analysis: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-huang24b, title = {Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation}, author = {Huang, Jiayi and Zhong, Han and Wang, Liwei and Yang, Lin}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {3673--3681}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/huang24b/huang24b.pdf}, url = {https://proceedings.mlr.press/v238/huang24b.html}, abstract = {To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analysis: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.} }
Endnote
%0 Conference Paper %T Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation %A Jiayi Huang %A Han Zhong %A Liwei Wang %A Lin Yang %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-huang24b %I PMLR %P 3673--3681 %U https://proceedings.mlr.press/v238/huang24b.html %V 238 %X To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analysis: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.
APA
Huang, J., Zhong, H., Wang, L. & Yang, L.. (2024). Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3673-3681 Available from https://proceedings.mlr.press/v238/huang24b.html.

Related Material