Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Alessandro Montenegro; Marco Mussi; Matteo Papini; Alberto Maria Metelli

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Alessandro Montenegro, Marco Mussi, Matteo Papini, Alberto Maria Metelli

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:44652-44698, 2025.

Abstract

Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order $\widetilde{\mathcal{O}}(\epsilon^{-5})$. Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-montenegro25a,
  title = 	 {Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity},
  author =       {Montenegro, Alessandro and Mussi, Marco and Papini, Matteo and Metelli, Alberto Maria},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {44652--44698},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/montenegro25a/montenegro25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/montenegro25a.html},
  abstract = 	 {Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order $\widetilde{\mathcal{O}}(\epsilon^{-5})$. Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.}
}

Endnote

%0 Conference Paper
%T Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity
%A Alessandro Montenegro
%A Marco Mussi
%A Matteo Papini
%A Alberto Maria Metelli
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-montenegro25a
%I PMLR
%P 44652--44698
%U https://proceedings.mlr.press/v267/montenegro25a.html
%V 267
%X Policy gradient (PG) methods are effective reinforcement learning (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order $\widetilde{\mathcal{O}}(\epsilon^{-5})$. Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

APA

Montenegro, A., Mussi, M., Papini, M. & Metelli, A.M.. (2025). Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:44652-44698 Available from https://proceedings.mlr.press/v267/montenegro25a.html.

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Abstract

Cite this Paper

Related Material