Reparameterized Policy Learning for Multimodal Trajectory Optimization

Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:13957-13975, 2023.

Abstract

We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-huang23k, title = {Reparameterized Policy Learning for Multimodal Trajectory Optimization}, author = {Huang, Zhiao and Liang, Litian and Ling, Zhan and Li, Xuanlin and Gan, Chuang and Su, Hao}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {13957--13975}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/huang23k/huang23k.pdf}, url = {https://proceedings.mlr.press/v202/huang23k.html}, abstract = {We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/} }
Endnote
%0 Conference Paper %T Reparameterized Policy Learning for Multimodal Trajectory Optimization %A Zhiao Huang %A Litian Liang %A Zhan Ling %A Xuanlin Li %A Chuang Gan %A Hao Su %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-huang23k %I PMLR %P 13957--13975 %U https://proceedings.mlr.press/v202/huang23k.html %V 202 %X We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/
APA
Huang, Z., Liang, L., Ling, Z., Li, X., Gan, C. & Su, H.. (2023). Reparameterized Policy Learning for Multimodal Trajectory Optimization. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:13957-13975 Available from https://proceedings.mlr.press/v202/huang23k.html.

Related Material