Order-Optimal Global Convergence for Actor-Critic with General Policy and Neural Critic Parametrization

Swetha Ganesh, Jiayu Chen, Washim Uddin Mondal, Vaneet Aggarwal
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1358-1380, 2025.

Abstract

This paper addresses the challenge of achieving order-optimal sample complexity in reinforcement learning for discounted Markov Decision Processes (MDPs) with general policy parameterization and multi-layer neural network critics. Existing approaches either fail to achieve the optimal rate or assume a linear critic. We introduce Natural Actor-Critic with Data Drop (NAC-DD) algorithm, which integrates Natural Policy Gradient methods with a Data Drop technique to mitigate statistical dependencies inherent in Markovian sampling. NAC-DD achieves an optimal sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^2)$, marking a significant improvement over the previous state-of-the-art guarantee of $\tilde{O}(1/\epsilon^3)$. The algorithm employs a multi-layer neural network critic with differentiable activation functions, aligning with real-world applications where tabular policies and linear critics are insufficient. Our work represents the first to achieve order-optimal sample complexity for actor-critic methods with neural function approximation, continuous state and action spaces, and Markovian sampling. Empirical evaluations on benchmark tasks confirm the theoretical findings, demonstrating the practical efficacy of the proposed method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-ganesh25a, title = {Order-Optimal Global Convergence for Actor-Critic with General Policy and Neural Critic Parametrization}, author = {Ganesh, Swetha and Chen, Jiayu and Mondal, Washim Uddin and Aggarwal, Vaneet}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1358--1380}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/ganesh25a/ganesh25a.pdf}, url = {https://proceedings.mlr.press/v286/ganesh25a.html}, abstract = {This paper addresses the challenge of achieving order-optimal sample complexity in reinforcement learning for discounted Markov Decision Processes (MDPs) with general policy parameterization and multi-layer neural network critics. Existing approaches either fail to achieve the optimal rate or assume a linear critic. We introduce Natural Actor-Critic with Data Drop (NAC-DD) algorithm, which integrates Natural Policy Gradient methods with a Data Drop technique to mitigate statistical dependencies inherent in Markovian sampling. NAC-DD achieves an optimal sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^2)$, marking a significant improvement over the previous state-of-the-art guarantee of $\tilde{O}(1/\epsilon^3)$. The algorithm employs a multi-layer neural network critic with differentiable activation functions, aligning with real-world applications where tabular policies and linear critics are insufficient. Our work represents the first to achieve order-optimal sample complexity for actor-critic methods with neural function approximation, continuous state and action spaces, and Markovian sampling. Empirical evaluations on benchmark tasks confirm the theoretical findings, demonstrating the practical efficacy of the proposed method.} }
Endnote
%0 Conference Paper %T Order-Optimal Global Convergence for Actor-Critic with General Policy and Neural Critic Parametrization %A Swetha Ganesh %A Jiayu Chen %A Washim Uddin Mondal %A Vaneet Aggarwal %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-ganesh25a %I PMLR %P 1358--1380 %U https://proceedings.mlr.press/v286/ganesh25a.html %V 286 %X This paper addresses the challenge of achieving order-optimal sample complexity in reinforcement learning for discounted Markov Decision Processes (MDPs) with general policy parameterization and multi-layer neural network critics. Existing approaches either fail to achieve the optimal rate or assume a linear critic. We introduce Natural Actor-Critic with Data Drop (NAC-DD) algorithm, which integrates Natural Policy Gradient methods with a Data Drop technique to mitigate statistical dependencies inherent in Markovian sampling. NAC-DD achieves an optimal sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^2)$, marking a significant improvement over the previous state-of-the-art guarantee of $\tilde{O}(1/\epsilon^3)$. The algorithm employs a multi-layer neural network critic with differentiable activation functions, aligning with real-world applications where tabular policies and linear critics are insufficient. Our work represents the first to achieve order-optimal sample complexity for actor-critic methods with neural function approximation, continuous state and action spaces, and Markovian sampling. Empirical evaluations on benchmark tasks confirm the theoretical findings, demonstrating the practical efficacy of the proposed method.
APA
Ganesh, S., Chen, J., Mondal, W.U. & Aggarwal, V.. (2025). Order-Optimal Global Convergence for Actor-Critic with General Policy and Neural Critic Parametrization. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1358-1380 Available from https://proceedings.mlr.press/v286/ganesh25a.html.

Related Material