Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks

Litian Liang; Yaosheng Xu; Stephen Mcaleer; Dailin Hu; Alexander Ihler; Pieter Abbeel; Roy Fox

Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks

Litian Liang, Yaosheng Xu, Stephen Mcaleer, Dailin Hu, Alexander Ihler, Pieter Abbeel, Roy Fox

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:13285-13301, 2022.

Abstract

In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimation variance as the root cause of overestimation. In this paper, we propose MeanQ, a simple ensemble method that estimates target values as ensemble means. Despite its simplicity, MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark. Importantly, we find that an ensemble of size 5 sufficiently reduces estimation variance to obviate the lagging target network, eliminating it as a source of bias and further gaining sample efficiency. We justify intuitively and empirically the design choices in MeanQ, including the necessity of independent experience sampling. On a set of 26 benchmark Atari environments, MeanQ outperforms all tested baselines, including the best available baseline, SUNRISE, at 100K interaction steps in 16/26 environments, and by 68% on average. MeanQ also outperforms Rainbow DQN at 500K steps in 21/26 environments, and by 49% on average, and achieves average human-level performance using 200K (

$\pm$ 100K) interaction steps. Our implementation is available at https://github.com/indylab/MeanQ.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-liang22c,
  title = 	 {Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks},
  author =       {Liang, Litian and Xu, Yaosheng and Mcaleer, Stephen and Hu, Dailin and Ihler, Alexander and Abbeel, Pieter and Fox, Roy},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {13285--13301},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/liang22c/liang22c.pdf},
  url = 	 {https://proceedings.mlr.press/v162/liang22c.html},
  abstract = 	 {In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimation variance as the root cause of overestimation. In this paper, we propose MeanQ, a simple ensemble method that estimates target values as ensemble means. Despite its simplicity, MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark. Importantly, we find that an ensemble of size 5 sufficiently reduces estimation variance to obviate the lagging target network, eliminating it as a source of bias and further gaining sample efficiency. We justify intuitively and empirically the design choices in MeanQ, including the necessity of independent experience sampling. On a set of 26 benchmark Atari environments, MeanQ outperforms all tested baselines, including the best available baseline, SUNRISE, at 100K interaction steps in 16/26 environments, and by 68% on average. MeanQ also outperforms Rainbow DQN at 500K steps in 21/26 environments, and by 49% on average, and achieves average human-level performance using 200K ($\pm$100K) interaction steps. Our implementation is available at https://github.com/indylab/MeanQ.}
}

Endnote

%0 Conference Paper
%T Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks
%A Litian Liang
%A Yaosheng Xu
%A Stephen Mcaleer
%A Dailin Hu
%A Alexander Ihler
%A Pieter Abbeel
%A Roy Fox
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-liang22c
%I PMLR
%P 13285--13301
%U https://proceedings.mlr.press/v162/liang22c.html
%V 162
%X In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimation variance as the root cause of overestimation. In this paper, we propose MeanQ, a simple ensemble method that estimates target values as ensemble means. Despite its simplicity, MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark. Importantly, we find that an ensemble of size 5 sufficiently reduces estimation variance to obviate the lagging target network, eliminating it as a source of bias and further gaining sample efficiency. We justify intuitively and empirically the design choices in MeanQ, including the necessity of independent experience sampling. On a set of 26 benchmark Atari environments, MeanQ outperforms all tested baselines, including the best available baseline, SUNRISE, at 100K interaction steps in 16/26 environments, and by 68% on average. MeanQ also outperforms Rainbow DQN at 500K steps in 21/26 environments, and by 49% on average, and achieves average human-level performance using 200K ($\pm$100K) interaction steps. Our implementation is available at https://github.com/indylab/MeanQ.

APA


Liang, L., Xu, Y., Mcaleer, S., Hu, D., Ihler, A., Abbeel, P. & Fox, R.. (2022). Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:13285-13301 Available from https://proceedings.mlr.press/v162/liang22c.html.

Related Material

Download PDF