Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Bhrij Patel, Wesley A Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Dinesh Manocha, Amrit Bedi
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:39889-39907, 2024.

Abstract

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}(\sqrt{\tau_{mix}})$ known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-patel24b, title = {Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles}, author = {Patel, Bhrij and Suttle, Wesley A and Koppel, Alec and Aggarwal, Vaneet and Sadler, Brian M. and Manocha, Dinesh and Bedi, Amrit}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {39889--39907}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/patel24b/patel24b.pdf}, url = {https://proceedings.mlr.press/v235/patel24b.html}, abstract = {In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}(\sqrt{\tau_{mix}})$ known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.} }
Endnote
%0 Conference Paper %T Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles %A Bhrij Patel %A Wesley A Suttle %A Alec Koppel %A Vaneet Aggarwal %A Brian M. Sadler %A Dinesh Manocha %A Amrit Bedi %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-patel24b %I PMLR %P 39889--39907 %U https://proceedings.mlr.press/v235/patel24b.html %V 235 %X In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}(\sqrt{\tau_{mix}})$ known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.
APA
Patel, B., Suttle, W.A., Koppel, A., Aggarwal, V., Sadler, B.M., Manocha, D. & Bedi, A.. (2024). Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:39889-39907 Available from https://proceedings.mlr.press/v235/patel24b.html.

Related Material