Revisiting Peng’s Q($λ$) for Modern Reinforcement Learning

Tadashi Kozuno, Yunhao Tang, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5794-5804, 2021.

Abstract

Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng’s Q($\lambda$), a representative example of non-conservative algorithms. We prove that \emph{it also converges to an optimal policy} provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng’s Q($\lambda$) in complex continuous control tasks, confirming that Peng’s Q($\lambda$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng’s Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-kozuno21a, title = {Revisiting Peng’s Q($λ$) for Modern Reinforcement Learning}, author = {Kozuno, Tadashi and Tang, Yunhao and Rowland, Mark and Munos, Remi and Kapturowski, Steven and Dabney, Will and Valko, Michal and Abel, David}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5794--5804}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/kozuno21a/kozuno21a.pdf}, url = {https://proceedings.mlr.press/v139/kozuno21a.html}, abstract = {Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng’s Q($\lambda$), a representative example of non-conservative algorithms. We prove that \emph{it also converges to an optimal policy} provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng’s Q($\lambda$) in complex continuous control tasks, confirming that Peng’s Q($\lambda$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng’s Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.} }
Endnote
%0 Conference Paper %T Revisiting Peng’s Q($λ$) for Modern Reinforcement Learning %A Tadashi Kozuno %A Yunhao Tang %A Mark Rowland %A Remi Munos %A Steven Kapturowski %A Will Dabney %A Michal Valko %A David Abel %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-kozuno21a %I PMLR %P 5794--5804 %U https://proceedings.mlr.press/v139/kozuno21a.html %V 139 %X Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng’s Q($\lambda$), a representative example of non-conservative algorithms. We prove that \emph{it also converges to an optimal policy} provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng’s Q($\lambda$) in complex continuous control tasks, confirming that Peng’s Q($\lambda$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng’s Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.
APA
Kozuno, T., Tang, Y., Rowland, M., Munos, R., Kapturowski, S., Dabney, W., Valko, M. & Abel, D.. (2021). Revisiting Peng’s Q($λ$) for Modern Reinforcement Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5794-5804 Available from https://proceedings.mlr.press/v139/kozuno21a.html.

Related Material