Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:3421-3429, 2025.

Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-ganesh25a, title = {Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs}, author = {Ganesh, Swetha and Mondal, Washim Uddin and Aggarwal, Vaneet}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {3421--3429}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/ganesh25a/ganesh25a.pdf}, url = {https://proceedings.mlr.press/v258/ganesh25a.html}, abstract = {We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.} }
Endnote
%0 Conference Paper %T Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs %A Swetha Ganesh %A Washim Uddin Mondal %A Vaneet Aggarwal %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-ganesh25a %I PMLR %P 3421--3429 %U https://proceedings.mlr.press/v258/ganesh25a.html %V 258 %X We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.
APA
Ganesh, S., Mondal, W.U. & Aggarwal, V.. (2025). Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:3421-3429 Available from https://proceedings.mlr.press/v258/ganesh25a.html.

Related Material