Q-learning with Logarithmic Regret

Kunhe Yang; Lin Yang; Simon Du

Q-learning with Logarithmic Regret

Kunhe Yang, Lin Yang, Simon Du

Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:1576-1584, 2021.

Abstract

This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

Cite this Paper

BibTeX


@InProceedings{pmlr-v130-yang21b,
  title = 	 { Q-learning with Logarithmic Regret },
  author =       {Yang, Kunhe and Yang, Lin and Du, Simon},
  booktitle = 	 {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1576--1584},
  year = 	 {2021},
  editor = 	 {Banerjee, Arindam and Fukumizu, Kenji},
  volume = 	 {130},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--15 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v130/yang21b/yang21b.pdf},
  url = 	 {https://proceedings.mlr.press/v130/yang21b.html},
  abstract = 	 { This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound. }
}

Endnote

%0 Conference Paper
%T  Q-learning with Logarithmic Regret 
%A Kunhe Yang
%A Lin Yang
%A Simon Du
%B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2021
%E Arindam Banerjee
%E Kenji Fukumizu	
%F pmlr-v130-yang21b
%I PMLR
%P 1576--1584
%U https://proceedings.mlr.press/v130/yang21b.html
%V 130
%X  This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

APA


Yang, K., Yang, L. & Du, S.. (2021).  Q-learning with Logarithmic Regret . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:1576-1584 Available from https://proceedings.mlr.press/v130/yang21b.html.

Q-learning with Logarithmic Regret

Abstract

Cite this Paper

Related Material