Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Jiafan He; Dongruo Zhou; Quanquan Gu

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Jiafan He, Dongruo Zhou, Quanquan Gu

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4171-4180, 2021.

Abstract

Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining

$\sqrt{T}$ -type regret bound, where

$T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al., 2020), the LSVI-UCB algorithm can achieve

$\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al., 2020), the UCRL-VTR algorithm can achieve

$\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where

$d$ is the dimension of feature mapping,

$H$ is the length of episode,

$\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and

$\tilde O$ hides all logarithmic terms except

$\log(T)$ . To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-he21c,
  title = 	 {Logarithmic Regret for Reinforcement Learning with Linear Function Approximation},
  author =       {He, Jiafan and Zhou, Dongruo and Gu, Quanquan},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {4171--4180},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/he21c/he21c.pdf},
  url = 	 {https://proceedings.mlr.press/v139/he21c.html},
  abstract = 	 {Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al., 2020), the LSVI-UCB algorithm can achieve $\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$regret; and under the linear mixture MDP assumption (Ayoub et al., 2020), the UCRL-VTR algorithm can achieve $\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and $\tilde O$ hides all logarithmic terms except $\log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.}
}

Endnote

%0 Conference Paper
%T Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
%A Jiafan He
%A Dongruo Zhou
%A Quanquan Gu
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-he21c
%I PMLR
%P 4171--4180
%U https://proceedings.mlr.press/v139/he21c.html
%V 139
%X Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al., 2020), the LSVI-UCB algorithm can achieve $\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$regret; and under the linear mixture MDP assumption (Ayoub et al., 2020), the UCRL-VTR algorithm can achieve $\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and $\tilde O$ hides all logarithmic terms except $\log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.

APA


He, J., Zhou, D. & Gu, Q.. (2021). Logarithmic Regret for Reinforcement Learning with Linear Function Approximation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4171-4180 Available from https://proceedings.mlr.press/v139/he21c.html.

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Abstract

Cite this Paper

Related Material