Wasserstein Policy Optimization

David Pfau; Ian Davies; Diana L Borsa; João Guilherme Madeira Araújo; Brendan Daniel Tracey; Hado Van Hasselt

Wasserstein Policy Optimization

David Pfau, Ian Davies, Diana L Borsa, João Guilherme Madeira Araújo, Brendan Daniel Tracey, Hado Van Hasselt

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:49128-49149, 2025.

Abstract

We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions – without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-pfau25a,
  title = 	 {{W}asserstein Policy Optimization},
  author =       {Pfau, David and Davies, Ian and Borsa, Diana L and Ara\'{u}jo, Jo\~{a}o Guilherme Madeira and Tracey, Brendan Daniel and Van Hasselt, Hado},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {49128--49149},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/pfau25a/pfau25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/pfau25a.html},
  abstract = 	 {We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions – without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.}
}

Endnote

%0 Conference Paper
%T Wasserstein Policy Optimization
%A David Pfau
%A Ian Davies
%A Diana L Borsa
%A João Guilherme Madeira Araújo
%A Brendan Daniel Tracey
%A Hado Van Hasselt
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-pfau25a
%I PMLR
%P 49128--49149
%U https://proceedings.mlr.press/v267/pfau25a.html
%V 267
%X We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions – without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

APA

Pfau, D., Davies, I., Borsa, D.L., Araújo, J.G.M., Tracey, B.D. & Van Hasselt, H.. (2025). Wasserstein Policy Optimization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:49128-49149 Available from https://proceedings.mlr.press/v267/pfau25a.html.

Wasserstein Policy Optimization

Abstract

Cite this Paper

Related Material