ConQUR: Mitigating Delusional Bias in Deep Q-Learning

Dijia Su, Jayden Ooi, Tyler Lu, Dale Schuurmans, Craig Boutilier
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9187-9195, 2020.

Abstract

Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-su20c, title = {{C}on{QUR}: Mitigating Delusional Bias in Deep Q-Learning}, author = {Su, Dijia and Ooi, Jayden and Lu, Tyler and Schuurmans, Dale and Boutilier, Craig}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {9187--9195}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/su20c/su20c.pdf}, url = {https://proceedings.mlr.press/v119/su20c.html}, abstract = {Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.} }
Endnote
%0 Conference Paper %T ConQUR: Mitigating Delusional Bias in Deep Q-Learning %A Dijia Su %A Jayden Ooi %A Tyler Lu %A Dale Schuurmans %A Craig Boutilier %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-su20c %I PMLR %P 9187--9195 %U https://proceedings.mlr.press/v119/su20c.html %V 119 %X Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.
APA
Su, D., Ooi, J., Lu, T., Schuurmans, D. & Boutilier, C.. (2020). ConQUR: Mitigating Delusional Bias in Deep Q-Learning. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:9187-9195 Available from https://proceedings.mlr.press/v119/su20c.html.

Related Material