Simplicity Bias via Global Convergence of Sharpness Minimization

Khashayar Gatmiry; Zhiyuan Li; Sashank J. Reddi; Stefanie Jegelka

Simplicity Bias via Global Convergence of Sharpness Minimization

Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:15102-15129, 2024.

Abstract

The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are ’simple’, the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property — a local geodesic convexity — of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-gatmiry24a,
  title = 	 {Simplicity Bias via Global Convergence of Sharpness Minimization},
  author =       {Gatmiry, Khashayar and Li, Zhiyuan and J. Reddi, Sashank and Jegelka, Stefanie},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {15102--15129},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gatmiry24a/gatmiry24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/gatmiry24a.html},
  abstract = 	 {The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are ’simple’, the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property — a local geodesic convexity — of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.}
}

Endnote

%0 Conference Paper
%T Simplicity Bias via Global Convergence of Sharpness Minimization
%A Khashayar Gatmiry
%A Zhiyuan Li
%A Sashank J. Reddi
%A Stefanie Jegelka
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-gatmiry24a
%I PMLR
%P 15102--15129
%U https://proceedings.mlr.press/v235/gatmiry24a.html
%V 235
%X The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are ’simple’, the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property — a local geodesic convexity — of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.

APA


Gatmiry, K., Li, Z., J. Reddi, S. & Jegelka, S.. (2024). Simplicity Bias via Global Convergence of Sharpness Minimization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:15102-15129 Available from https://proceedings.mlr.press/v235/gatmiry24a.html.

Simplicity Bias via Global Convergence of Sharpness Minimization

Abstract

Cite this Paper

Related Material