Policy Optimization as Wasserstein Gradient Flows

Ruiyi Zhang, Changyou Chen, Chunyuan Li, Lawrence Carin
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5737-5746, 2018.

Abstract

Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its correspondence to policy-distribution optimization has been unclear mathematically. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, policy optimization becomes convex in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-of-the-art policy-optimization algorithms. Specifically, we define gradient flows on both the parameter-distribution space and policy-distribution space, leading to what we term indirect-policy and direct-policy learning frameworks, respectively. Extensive experiments verify the effectiveness of our framework, often obtaining better performance compared to related algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-zhang18a, title = {Policy Optimization as {W}asserstein Gradient Flows}, author = {Zhang, Ruiyi and Chen, Changyou and Li, Chunyuan and Carin, Lawrence}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5737--5746}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/zhang18a/zhang18a.pdf}, url = {https://proceedings.mlr.press/v80/zhang18a.html}, abstract = {Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its correspondence to policy-distribution optimization has been unclear mathematically. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, policy optimization becomes convex in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-of-the-art policy-optimization algorithms. Specifically, we define gradient flows on both the parameter-distribution space and policy-distribution space, leading to what we term indirect-policy and direct-policy learning frameworks, respectively. Extensive experiments verify the effectiveness of our framework, often obtaining better performance compared to related algorithms.} }
Endnote
%0 Conference Paper %T Policy Optimization as Wasserstein Gradient Flows %A Ruiyi Zhang %A Changyou Chen %A Chunyuan Li %A Lawrence Carin %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-zhang18a %I PMLR %P 5737--5746 %U https://proceedings.mlr.press/v80/zhang18a.html %V 80 %X Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its correspondence to policy-distribution optimization has been unclear mathematically. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, policy optimization becomes convex in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-of-the-art policy-optimization algorithms. Specifically, we define gradient flows on both the parameter-distribution space and policy-distribution space, leading to what we term indirect-policy and direct-policy learning frameworks, respectively. Extensive experiments verify the effectiveness of our framework, often obtaining better performance compared to related algorithms.
APA
Zhang, R., Chen, C., Li, C. & Carin, L.. (2018). Policy Optimization as Wasserstein Gradient Flows. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5737-5746 Available from https://proceedings.mlr.press/v80/zhang18a.html.

Related Material