On the Provable Separation of Scales in Maximal Update Parameterization

Letong Hong, Zhangyang Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23774-23786, 2025.

Abstract

Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zero-shot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind the observed approximate transferability of hyperparameters remains underexplored. Relying on a width-dominance regime, which ensures that as width grows, certain terms of the learning dynamics dominate, we establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e., early statistics effectively approximate global hyperparameter optima, implying the potential to further reduce the training costs required for searching optimal hyperparameters. We further apply our main theory to explain an empirical deep learning phenomenon discovered independently by prior work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hong25h, title = {On the Provable Separation of Scales in Maximal Update Parameterization}, author = {Hong, Letong and Wang, Zhangyang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {23774--23786}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hong25h/hong25h.pdf}, url = {https://proceedings.mlr.press/v267/hong25h.html}, abstract = {Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zero-shot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind the observed approximate transferability of hyperparameters remains underexplored. Relying on a width-dominance regime, which ensures that as width grows, certain terms of the learning dynamics dominate, we establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e., early statistics effectively approximate global hyperparameter optima, implying the potential to further reduce the training costs required for searching optimal hyperparameters. We further apply our main theory to explain an empirical deep learning phenomenon discovered independently by prior work.} }
Endnote
%0 Conference Paper %T On the Provable Separation of Scales in Maximal Update Parameterization %A Letong Hong %A Zhangyang Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hong25h %I PMLR %P 23774--23786 %U https://proceedings.mlr.press/v267/hong25h.html %V 267 %X Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zero-shot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind the observed approximate transferability of hyperparameters remains underexplored. Relying on a width-dominance regime, which ensures that as width grows, certain terms of the learning dynamics dominate, we establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e., early statistics effectively approximate global hyperparameter optima, implying the potential to further reduce the training costs required for searching optimal hyperparameters. We further apply our main theory to explain an empirical deep learning phenomenon discovered independently by prior work.
APA
Hong, L. & Wang, Z.. (2025). On the Provable Separation of Scales in Maximal Update Parameterization. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23774-23786 Available from https://proceedings.mlr.press/v267/hong25h.html.

Related Material