Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Uri Sherman, Tomer Koren, Yishay Mansour
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54825-54863, 2025.

Abstract

Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-sherman25a, title = {Convergence of Policy Mirror Descent Beyond Compatible Function Approximation}, author = {Sherman, Uri and Koren, Tomer and Mansour, Yishay}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {54825--54863}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/sherman25a/sherman25a.pdf}, url = {https://proceedings.mlr.press/v267/sherman25a.html}, abstract = {Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.} }
Endnote
%0 Conference Paper %T Convergence of Policy Mirror Descent Beyond Compatible Function Approximation %A Uri Sherman %A Tomer Koren %A Yishay Mansour %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-sherman25a %I PMLR %P 54825--54863 %U https://proceedings.mlr.press/v267/sherman25a.html %V 267 %X Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.
APA
Sherman, U., Koren, T. & Mansour, Y.. (2025). Convergence of Policy Mirror Descent Beyond Compatible Function Approximation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54825-54863 Available from https://proceedings.mlr.press/v267/sherman25a.html.

Related Material