What is the Long-Run Distribution of Stochastic Gradient Descent? A Large Deviations Analysis

Waı̈ss Azizian, Franck Iutzeler, Jerome Malick, Panayotis Mertikopoulos
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:2168-2229, 2024.

Abstract

In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem’s state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method’s step-size and energy levels determined by the problem’s objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem’s critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem’s minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally, (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-azizian24a, title = {What is the Long-Run Distribution of Stochastic Gradient Descent? {A} Large Deviations Analysis}, author = {Azizian, Wa\"{\i}ss and Iutzeler, Franck and Malick, Jerome and Mertikopoulos, Panayotis}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {2168--2229}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/azizian24a/azizian24a.pdf}, url = {https://proceedings.mlr.press/v235/azizian24a.html}, abstract = {In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem’s state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method’s step-size and energy levels determined by the problem’s objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem’s critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem’s minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally, (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.} }
Endnote
%0 Conference Paper %T What is the Long-Run Distribution of Stochastic Gradient Descent? A Large Deviations Analysis %A Waı̈ss Azizian %A Franck Iutzeler %A Jerome Malick %A Panayotis Mertikopoulos %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-azizian24a %I PMLR %P 2168--2229 %U https://proceedings.mlr.press/v235/azizian24a.html %V 235 %X In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem’s state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method’s step-size and energy levels determined by the problem’s objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem’s critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem’s minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally, (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.
APA
Azizian, W., Iutzeler, F., Malick, J. & Mertikopoulos, P.. (2024). What is the Long-Run Distribution of Stochastic Gradient Descent? A Large Deviations Analysis. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:2168-2229 Available from https://proceedings.mlr.press/v235/azizian24a.html.

Related Material