SGD with Weight Decay Secretly Minimizes the Ranks of Your Neural Networks

Tomer Galanti, Zachary S Siegel, Aparna Gupte, Tomaso A Poggio
Conference on Parsimony and Learning, PMLR 280:1388-1412, 2025.

Abstract

We explore the implicit bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Through theoretical analysis and empirical validation, we demonstrate that this rank-minimizing bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Unlike previous studies, our analysis does not rely on restrictive assumptions about data, convergence, optimality of the learned weight matrices, network architecture, making it applicable to a wide range of neural network architectures of any width or depth. We further show that weight decay is essential for inducing this low-rank bias. Finally, we empirically explore the connection between this bias and generalization, finding that it has a noticeable, yet marginal, effect on the test performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v280-galanti25a, title = {SGD with Weight Decay Secretly Minimizes the Ranks of Your Neural Networks}, author = {Galanti, Tomer and Siegel, Zachary S and Gupte, Aparna and Poggio, Tomaso A}, booktitle = {Conference on Parsimony and Learning}, pages = {1388--1412}, year = {2025}, editor = {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui}, volume = {280}, series = {Proceedings of Machine Learning Research}, month = {24--27 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v280/main/assets/galanti25a/galanti25a.pdf}, url = {https://proceedings.mlr.press/v280/galanti25a.html}, abstract = {We explore the implicit bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Through theoretical analysis and empirical validation, we demonstrate that this rank-minimizing bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Unlike previous studies, our analysis does not rely on restrictive assumptions about data, convergence, optimality of the learned weight matrices, network architecture, making it applicable to a wide range of neural network architectures of any width or depth. We further show that weight decay is essential for inducing this low-rank bias. Finally, we empirically explore the connection between this bias and generalization, finding that it has a noticeable, yet marginal, effect on the test performance.} }
Endnote
%0 Conference Paper %T SGD with Weight Decay Secretly Minimizes the Ranks of Your Neural Networks %A Tomer Galanti %A Zachary S Siegel %A Aparna Gupte %A Tomaso A Poggio %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2025 %E Beidi Chen %E Shijia Liu %E Mert Pilanci %E Weijie Su %E Jeremias Sulam %E Yuxiang Wang %E Zhihui Zhu %F pmlr-v280-galanti25a %I PMLR %P 1388--1412 %U https://proceedings.mlr.press/v280/galanti25a.html %V 280 %X We explore the implicit bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Through theoretical analysis and empirical validation, we demonstrate that this rank-minimizing bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Unlike previous studies, our analysis does not rely on restrictive assumptions about data, convergence, optimality of the learned weight matrices, network architecture, making it applicable to a wide range of neural network architectures of any width or depth. We further show that weight decay is essential for inducing this low-rank bias. Finally, we empirically explore the connection between this bias and generalization, finding that it has a noticeable, yet marginal, effect on the test performance.
APA
Galanti, T., Siegel, Z.S., Gupte, A. & Poggio, T.A.. (2025). SGD with Weight Decay Secretly Minimizes the Ranks of Your Neural Networks. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:1388-1412 Available from https://proceedings.mlr.press/v280/galanti25a.html.

Related Material