On the Weight Dynamics of Deep Normalized Networks

Christian H.X. Ali Mehmeti-Göpel, Michael Wand
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:992-1007, 2024.

Abstract

Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a "critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-ali-mehmeti-gopel24a, title = {On the Weight Dynamics of Deep Normalized Networks}, author = {Ali Mehmeti-G\"{o}pel, Christian H.X. and Wand, Michael}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {992--1007}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ali-mehmeti-gopel24a/ali-mehmeti-gopel24a.pdf}, url = {https://proceedings.mlr.press/v235/ali-mehmeti-gopel24a.html}, abstract = {Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a "critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.} }
Endnote
%0 Conference Paper %T On the Weight Dynamics of Deep Normalized Networks %A Christian H.X. Ali Mehmeti-Göpel %A Michael Wand %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-ali-mehmeti-gopel24a %I PMLR %P 992--1007 %U https://proceedings.mlr.press/v235/ali-mehmeti-gopel24a.html %V 235 %X Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a "critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.
APA
Ali Mehmeti-Göpel, C.H. & Wand, M.. (2024). On the Weight Dynamics of Deep Normalized Networks. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:992-1007 Available from https://proceedings.mlr.press/v235/ali-mehmeti-gopel24a.html.

Related Material