A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning

Mizhaan P. Maniyar, Prashanth L.A., Akash Mondal, Shalabh Bhatnagar
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:4708-4716, 2024.

Abstract

We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-maniyar24a, title = { A Cubic-regularized Policy {N}ewton Algorithm for Reinforcement Learning }, author = {Maniyar, Mizhaan P. and L.A., Prashanth and Mondal, Akash and Bhatnagar, Shalabh}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {4708--4716}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/maniyar24a/maniyar24a.pdf}, url = {https://proceedings.mlr.press/v238/maniyar24a.html}, abstract = { We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$. } }
Endnote
%0 Conference Paper %T A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning %A Mizhaan P. Maniyar %A Prashanth L.A. %A Akash Mondal %A Shalabh Bhatnagar %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-maniyar24a %I PMLR %P 4708--4716 %U https://proceedings.mlr.press/v238/maniyar24a.html %V 238 %X We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.
APA
Maniyar, M.P., L.A., P., Mondal, A. & Bhatnagar, S.. (2024). A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning . Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:4708-4716 Available from https://proceedings.mlr.press/v238/maniyar24a.html.

Related Material