Revisiting the Noise Model of Stochastic Gradient Descent

Barak Battash, Lior Wolf, Ofir Lindenbaum
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:4780-4788, 2024.

Abstract

The effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al (2019) demonstrated that the $S\alpha S$ Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the $S\alpha S$ distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-battash24a, title = {Revisiting the Noise Model of Stochastic Gradient Descent}, author = {Battash, Barak and Wolf, Lior and Lindenbaum, Ofir}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {4780--4788}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/battash24a/battash24a.pdf}, url = {https://proceedings.mlr.press/v238/battash24a.html}, abstract = {The effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al (2019) demonstrated that the $S\alpha S$ Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the $S\alpha S$ distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.} }
Endnote
%0 Conference Paper %T Revisiting the Noise Model of Stochastic Gradient Descent %A Barak Battash %A Lior Wolf %A Ofir Lindenbaum %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-battash24a %I PMLR %P 4780--4788 %U https://proceedings.mlr.press/v238/battash24a.html %V 238 %X The effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al (2019) demonstrated that the $S\alpha S$ Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the $S\alpha S$ distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.
APA
Battash, B., Wolf, L. & Lindenbaum, O.. (2024). Revisiting the Noise Model of Stochastic Gradient Descent. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:4780-4788 Available from https://proceedings.mlr.press/v238/battash24a.html.

Related Material