Emphatic Algorithms for Deep Reinforcement Learning

Ray Jiang, Tom Zahavy, Zhongwen Xu, Adam White, Matteo Hessel, Charles Blundell, Hado Van Hasselt
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5023-5033, 2021.

Abstract

Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling—this is known as the “deadly triad”. Emphatic temporal difference (ETD(λ)) algorithm ensures convergence in the linear case by appropriately weighting the TD(λ) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD(λ) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-jiang21j, title = {Emphatic Algorithms for Deep Reinforcement Learning}, author = {Jiang, Ray and Zahavy, Tom and Xu, Zhongwen and White, Adam and Hessel, Matteo and Blundell, Charles and Van Hasselt, Hado}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5023--5033}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/jiang21j/jiang21j.pdf}, url = {https://proceedings.mlr.press/v139/jiang21j.html}, abstract = {Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling—this is known as the “deadly triad”. Emphatic temporal difference (ETD($\lambda$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($\lambda$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($\lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.} }
Endnote
%0 Conference Paper %T Emphatic Algorithms for Deep Reinforcement Learning %A Ray Jiang %A Tom Zahavy %A Zhongwen Xu %A Adam White %A Matteo Hessel %A Charles Blundell %A Hado Van Hasselt %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-jiang21j %I PMLR %P 5023--5033 %U https://proceedings.mlr.press/v139/jiang21j.html %V 139 %X Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling—this is known as the “deadly triad”. Emphatic temporal difference (ETD($\lambda$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($\lambda$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($\lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.
APA
Jiang, R., Zahavy, T., Xu, Z., White, A., Hessel, M., Blundell, C. & Van Hasselt, H.. (2021). Emphatic Algorithms for Deep Reinforcement Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5023-5033 Available from https://proceedings.mlr.press/v139/jiang21j.html.

Related Material