SGD with Large Step Sizes Learns Sparse Features

Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, Nicolas Flammarion
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:903-925, 2023.

Abstract

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-andriushchenko23b, title = {{SGD} with Large Step Sizes Learns Sparse Features}, author = {Andriushchenko, Maksym and Varre, Aditya Vardhan and Pillaud-Vivien, Loucas and Flammarion, Nicolas}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {903--925}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/andriushchenko23b/andriushchenko23b.pdf}, url = {https://proceedings.mlr.press/v202/andriushchenko23b.html}, abstract = {We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks.} }
Endnote
%0 Conference Paper %T SGD with Large Step Sizes Learns Sparse Features %A Maksym Andriushchenko %A Aditya Vardhan Varre %A Loucas Pillaud-Vivien %A Nicolas Flammarion %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-andriushchenko23b %I PMLR %P 903--925 %U https://proceedings.mlr.press/v202/andriushchenko23b.html %V 202 %X We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks.
APA
Andriushchenko, M., Varre, A.V., Pillaud-Vivien, L. & Flammarion, N.. (2023). SGD with Large Step Sizes Learns Sparse Features. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:903-925 Available from https://proceedings.mlr.press/v202/andriushchenko23b.html.

Related Material