Nonlinear Distributional Gradient Temporal-Difference Learning

Chao Qu, Shie Mannor, Huan Xu
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5251-5260, 2019.

Abstract

We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In the policy evaluation setting, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram{é}r distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. In the control setting, we propose the distributional Greedy-GQ using similar derivation. We prove the asymptotic almost-sure convergence of distributional GTD2 and TDC to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. In each step, the computational complexity of above three algorithms is linear w.r.t. the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-qu19b, title = {Nonlinear Distributional Gradient Temporal-Difference Learning}, author = {Qu, Chao and Mannor, Shie and Xu, Huan}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {5251--5260}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/qu19b/qu19b.pdf}, url = {https://proceedings.mlr.press/v97/qu19b.html}, abstract = {We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In the policy evaluation setting, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram{é}r distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. In the control setting, we propose the distributional Greedy-GQ using similar derivation. We prove the asymptotic almost-sure convergence of distributional GTD2 and TDC to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. In each step, the computational complexity of above three algorithms is linear w.r.t. the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.} }
Endnote
%0 Conference Paper %T Nonlinear Distributional Gradient Temporal-Difference Learning %A Chao Qu %A Shie Mannor %A Huan Xu %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-qu19b %I PMLR %P 5251--5260 %U https://proceedings.mlr.press/v97/qu19b.html %V 97 %X We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In the policy evaluation setting, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram{é}r distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. In the control setting, we propose the distributional Greedy-GQ using similar derivation. We prove the asymptotic almost-sure convergence of distributional GTD2 and TDC to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. In each step, the computational complexity of above three algorithms is linear w.r.t. the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.
APA
Qu, C., Mannor, S. & Xu, H.. (2019). Nonlinear Distributional Gradient Temporal-Difference Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:5251-5260 Available from https://proceedings.mlr.press/v97/qu19b.html.

Related Material