[edit]
Power-Law Escape Rate of SGD
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:15959-15975, 2022.
Abstract
Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier ΔlogL=log[L(θs)/L(θ∗)] between a local minimum θ∗ and a saddle θs determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier ΔL=L(θs)−L(θ∗) decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude h∗ and the number n of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.