Learning Dynamics and Generalization in Deep Reinforcement Learning

Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, Yarin Gal
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:14560-14581, 2022.

Abstract

Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how post-training policy distillation may avoid this pitfall, and show that this approach improves generalization to novel environments in the ProcGen suite and improves robustness to input perturbations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-lyle22a, title = {Learning Dynamics and Generalization in Deep Reinforcement Learning}, author = {Lyle, Clare and Rowland, Mark and Dabney, Will and Kwiatkowska, Marta and Gal, Yarin}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {14560--14581}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/lyle22a/lyle22a.pdf}, url = {https://proceedings.mlr.press/v162/lyle22a.html}, abstract = {Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how post-training policy distillation may avoid this pitfall, and show that this approach improves generalization to novel environments in the ProcGen suite and improves robustness to input perturbations.} }
Endnote
%0 Conference Paper %T Learning Dynamics and Generalization in Deep Reinforcement Learning %A Clare Lyle %A Mark Rowland %A Will Dabney %A Marta Kwiatkowska %A Yarin Gal %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-lyle22a %I PMLR %P 14560--14581 %U https://proceedings.mlr.press/v162/lyle22a.html %V 162 %X Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how post-training policy distillation may avoid this pitfall, and show that this approach improves generalization to novel environments in the ProcGen suite and improves robustness to input perturbations.
APA
Lyle, C., Rowland, M., Dabney, W., Kwiatkowska, M. & Gal, Y.. (2022). Learning Dynamics and Generalization in Deep Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:14560-14581 Available from https://proceedings.mlr.press/v162/lyle22a.html.

Related Material