PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4065-4074, 2018.

Abstract

Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by $10^6$ times.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-parmas18a, title = {{PIPPS}: Flexible Model-Based Policy Search Robust to the Curse of Chaos}, author = {Parmas, Paavo and Rasmussen, Carl Edward and Peters, Jan and Doya, Kenji}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {4065--4074}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/parmas18a/parmas18a.pdf}, url = {https://proceedings.mlr.press/v80/parmas18a.html}, abstract = {Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by $10^6$ times.} }
Endnote
%0 Conference Paper %T PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos %A Paavo Parmas %A Carl Edward Rasmussen %A Jan Peters %A Kenji Doya %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-parmas18a %I PMLR %P 4065--4074 %U https://proceedings.mlr.press/v80/parmas18a.html %V 80 %X Previously, the exploding gradient problem has been explained to be central in deep learning and model-based reinforcement learning, because it causes numerical issues and instability in optimization. Our experiments in model-based reinforcement learning imply that the problem is not just a numerical issue, but it may be caused by a fundamental chaos-like nature of long chains of nonlinear computations. Not only do the magnitudes of the gradients become large, the direction of the gradients becomes essentially random. We show that reparameterization gradients suffer from the problem, while likelihood ratio gradients are robust. Using our insights, we develop a model-based policy search framework, Probabilistic Inference for Particle-Based Policy Search (PIPPS), which is easily extensible, and allows for almost arbitrary models and policies, while simultaneously matching the performance of previous data-efficient learning algorithms. Finally, we invent the total propagation algorithm, which efficiently computes a union over all pathwise derivative depths during a single backwards pass, automatically giving greater weight to estimators with lower variance, sometimes improving over reparameterization gradients by $10^6$ times.
APA
Parmas, P., Rasmussen, C.E., Peters, J. & Doya, K.. (2018). PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:4065-4074 Available from https://proceedings.mlr.press/v80/parmas18a.html.

Related Material