Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization

Majid Ghasemi, Mark Crowley
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:962-967, 2026.

Abstract

While scaling laws have revolutionized supervised learning, their implications for Deep Reinforcement Learning remain under-explored. This paper investigates the theoretical and practical scaling limits of Deep Q-Networks by controlling network parameterization across varying widths. Our empirical results on CartPole-v1 demonstrate that: (1) The standard Feature Learning regime (Mean-Field Theory, $\alpha=1$) achieves the highest peak performance (Return $79.6$) but suffers from catastrophic divergence and rank collapse at large widths; (2) The Lazy Training regime (NTK, $\alpha=0$) is performant (Return $72.1$) but numerically ill-conditioned; and (3) Maximal Update Parametrization ($\mu P$, $\alpha=0.5$) acts as a robust stabilizer, preventing divergence and rank collapse across the entire hyperparameter spectrum, albeit with more conservative learning dynamics (Return 49.7). These findings suggest that while feature learning is necessary for optimal control, naively scaling width without controlling update dynamics leads to optimization instability.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-ghasemi26a, title = {Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization}, author = {Ghasemi, Majid and Crowley, Mark}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {962--967}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/ghasemi26a/ghasemi26a.pdf}, url = {https://proceedings.mlr.press/v318/ghasemi26a.html}, abstract = {While scaling laws have revolutionized supervised learning, their implications for Deep Reinforcement Learning remain under-explored. This paper investigates the theoretical and practical scaling limits of Deep Q-Networks by controlling network parameterization across varying widths. Our empirical results on CartPole-v1 demonstrate that: (1) The standard Feature Learning regime (Mean-Field Theory, $\alpha=1$) achieves the highest peak performance (Return $79.6$) but suffers from catastrophic divergence and rank collapse at large widths; (2) The Lazy Training regime (NTK, $\alpha=0$) is performant (Return $72.1$) but numerically ill-conditioned; and (3) Maximal Update Parametrization ($\mu P$, $\alpha=0.5$) acts as a robust stabilizer, preventing divergence and rank collapse across the entire hyperparameter spectrum, albeit with more conservative learning dynamics (Return 49.7). These findings suggest that while feature learning is necessary for optimal control, naively scaling width without controlling update dynamics leads to optimization instability.} }
Endnote
%0 Conference Paper %T Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization %A Majid Ghasemi %A Mark Crowley %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-ghasemi26a %I PMLR %P 962--967 %U https://proceedings.mlr.press/v318/ghasemi26a.html %V 318 %X While scaling laws have revolutionized supervised learning, their implications for Deep Reinforcement Learning remain under-explored. This paper investigates the theoretical and practical scaling limits of Deep Q-Networks by controlling network parameterization across varying widths. Our empirical results on CartPole-v1 demonstrate that: (1) The standard Feature Learning regime (Mean-Field Theory, $\alpha=1$) achieves the highest peak performance (Return $79.6$) but suffers from catastrophic divergence and rank collapse at large widths; (2) The Lazy Training regime (NTK, $\alpha=0$) is performant (Return $72.1$) but numerically ill-conditioned; and (3) Maximal Update Parametrization ($\mu P$, $\alpha=0.5$) acts as a robust stabilizer, preventing divergence and rank collapse across the entire hyperparameter spectrum, albeit with more conservative learning dynamics (Return 49.7). These findings suggest that while feature learning is necessary for optimal control, naively scaling width without controlling update dynamics leads to optimization instability.
APA
Ghasemi, M. & Crowley, M.. (2026). Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:962-967 Available from https://proceedings.mlr.press/v318/ghasemi26a.html.

Related Material