Scaling Exponents Across Parameterizations and Optimizers

Katie E Everett; Lechao Xiao; Mitchell Wortsman; Alexander A Alemi; Roman Novak; Peter J Liu; Izzeddin Gur; Jascha Sohl-Dickstein; Leslie Pack Kaelbling; Jaehoon Lee; Jeffrey Pennington

Scaling Exponents Across Parameterizations and Optimizers

Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:12666-12700, 2024.

Abstract

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 27B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-everett24a,
  title = 	 {Scaling Exponents Across Parameterizations and Optimizers},
  author =       {Everett, Katie E and Xiao, Lechao and Wortsman, Mitchell and Alemi, Alexander A and Novak, Roman and Liu, Peter J and Gur, Izzeddin and Sohl-Dickstein, Jascha and Kaelbling, Leslie Pack and Lee, Jaehoon and Pennington, Jeffrey},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {12666--12700},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/everett24a/everett24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/everett24a.html},
  abstract = 	 {Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 27B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.}
}

Endnote

%0 Conference Paper
%T Scaling Exponents Across Parameterizations and Optimizers
%A Katie E Everett
%A Lechao Xiao
%A Mitchell Wortsman
%A Alexander A Alemi
%A Roman Novak
%A Peter J Liu
%A Izzeddin Gur
%A Jascha Sohl-Dickstein
%A Leslie Pack Kaelbling
%A Jaehoon Lee
%A Jeffrey Pennington
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-everett24a
%I PMLR
%P 12666--12700
%U https://proceedings.mlr.press/v235/everett24a.html
%V 235
%X Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 27B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

APA


Everett, K.E., Xiao, L., Wortsman, M., Alemi, A.A., Novak, R., Liu, P.J., Gur, I., Sohl-Dickstein, J., Kaelbling, L.P., Lee, J. & Pennington, J.. (2024). Scaling Exponents Across Parameterizations and Optimizers. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:12666-12700 Available from https://proceedings.mlr.press/v235/everett24a.html.

Scaling Exponents Across Parameterizations and Optimizers

Abstract

Cite this Paper

Related Material