Ensemble Bootstrapping for Q-Learning

Oren Peer; Chen Tessler; Nadav Merlis; Ron Meir

Ensemble Bootstrapping for Q-Learning

Oren Peer, Chen Tessler, Nadav Merlis, Ron Meir

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8454-8463, 2021.

Abstract

Q-learning (QL), a common reinforcement learning algorithm, suffers from over-estimation bias due to the maximization term in the optimal Bellman operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance. In this work, we introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL), a natural extension of Double-Q-learning to ensembles. We analyze our method both theoretically and empirically. Theoretically, we prove that EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. Finally, We demonstrate the superior performance of a deep RL variant of EBQL over other deep QL algorithms for a suite of ATARI games.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-peer21a,
  title = 	 {Ensemble Bootstrapping for Q-Learning},
  author =       {Peer, Oren and Tessler, Chen and Merlis, Nadav and Meir, Ron},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {8454--8463},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/peer21a/peer21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/peer21a.html},
  abstract = 	 {Q-learning (QL), a common reinforcement learning algorithm, suffers from over-estimation bias due to the maximization term in the optimal Bellman operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance. In this work, we introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL), a natural extension of Double-Q-learning to ensembles. We analyze our method both theoretically and empirically. Theoretically, we prove that EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. Finally, We demonstrate the superior performance of a deep RL variant of EBQL over other deep QL algorithms for a suite of ATARI games.}
}

Endnote

%0 Conference Paper
%T Ensemble Bootstrapping for Q-Learning
%A Oren Peer
%A Chen Tessler
%A Nadav Merlis
%A Ron Meir
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-peer21a
%I PMLR
%P 8454--8463
%U https://proceedings.mlr.press/v139/peer21a.html
%V 139
%X Q-learning (QL), a common reinforcement learning algorithm, suffers from over-estimation bias due to the maximization term in the optimal Bellman operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance. In this work, we introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL), a natural extension of Double-Q-learning to ensembles. We analyze our method both theoretically and empirically. Theoretically, we prove that EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. Finally, We demonstrate the superior performance of a deep RL variant of EBQL over other deep QL algorithms for a suite of ATARI games.

APA

Peer, O., Tessler, C., Merlis, N. & Meir, R.. (2021). Ensemble Bootstrapping for Q-Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8454-8463 Available from https://proceedings.mlr.press/v139/peer21a.html.

Ensemble Bootstrapping for Q-Learning

Abstract

Cite this Paper

Related Material