UCB Momentum Q-learning: Correcting the bias without forgetting

Pierre Menard; Omar Darwiche Domingues; Xuedong Shang; Michal Valko

UCB Momentum Q-learning: Correcting the bias without forgetting

Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7609-7618, 2021.

Abstract

We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the bias that Q-learning suffers while, \emph{at the same time}, limiting the impact it has on the second-order term of the regret. For UCBMQ, we are able to guarantee a regret of at most $\tilde{O}(\sqrt{H^3SAT}+ H^4 S A)$ where $H$ is the length of an episode, $S$ the number of states, $A$ the number of actions, $T$ the number of episodes and ignoring terms in poly$\log(SAHT)$. Notably, UCBMQ is the first algorithm that simultaneously matches the lower bound of $\Omega(\sqrt{H^3SAT})$ for large enough $T$ and has a second-order term (with respect to $T$) that scales \emph{only linearly} with the number of states $S$.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-menard21b,
  title = 	 {UCB Momentum Q-learning: Correcting the bias without forgetting},
  author =       {Menard, Pierre and Domingues, Omar Darwiche and Shang, Xuedong and Valko, Michal},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {7609--7618},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/menard21b/menard21b.pdf},
  url = 	 {https://proceedings.mlr.press/v139/menard21b.html},
  abstract = 	 {We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the bias that Q-learning suffers while, \emph{at the same time}, limiting the impact it has on the second-order term of the regret. For UCBMQ, we are able to guarantee a regret of at most $\tilde{O}(\sqrt{H^3SAT}+ H^4 S A)$ where $H$ is the length of an episode, $S$ the number of states, $A$ the number of actions, $T$ the number of episodes and ignoring terms in poly$\log(SAHT)$. Notably, UCBMQ is the first algorithm that simultaneously matches the lower bound of $\Omega(\sqrt{H^3SAT})$ for large enough $T$ and has a second-order term (with respect to $T$) that scales \emph{only linearly} with the number of states $S$.}
}

Endnote

%0 Conference Paper
%T UCB Momentum Q-learning: Correcting the bias without forgetting
%A Pierre Menard
%A Omar Darwiche Domingues
%A Xuedong Shang
%A Michal Valko
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-menard21b
%I PMLR
%P 7609--7618
%U https://proceedings.mlr.press/v139/menard21b.html
%V 139
%X We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the bias that Q-learning suffers while, \emph{at the same time}, limiting the impact it has on the second-order term of the regret. For UCBMQ, we are able to guarantee a regret of at most $\tilde{O}(\sqrt{H^3SAT}+ H^4 S A)$ where $H$ is the length of an episode, $S$ the number of states, $A$ the number of actions, $T$ the number of episodes and ignoring terms in poly$\log(SAHT)$. Notably, UCBMQ is the first algorithm that simultaneously matches the lower bound of $\Omega(\sqrt{H^3SAT})$ for large enough $T$ and has a second-order term (with respect to $T$) that scales \emph{only linearly} with the number of states $S$.

APA

Menard, P., Domingues, O.D., Shang, X. & Valko, M.. (2021). UCB Momentum Q-learning: Correcting the bias without forgetting. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7609-7618 Available from https://proceedings.mlr.press/v139/menard21b.html.

UCB Momentum Q-learning: Correcting the bias without forgetting

Abstract

Cite this Paper

Related Material