Shape Matters: Understanding the Implicit Bias of the Noise Covariance

Jeff Z. HaoChen; Colin Wei; Jason Lee; Tengyu Ma

Shape Matters: Understanding the Implicit Bias of the Noise Covariance

Jeff Z. HaoChen, Colin Wei, Jason Lee, Tengyu Ma

Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:2315-2357, 2021.

Abstract

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.

Cite this Paper

BibTeX


@InProceedings{pmlr-v134-haochen21a,
  title = 	 {Shape Matters: Understanding the Implicit Bias of the Noise Covariance},
  author =       {HaoChen, Jeff Z. and Wei, Colin and Lee, Jason and Ma, Tengyu},
  booktitle = 	 {Proceedings of Thirty Fourth Conference on Learning Theory},
  pages = 	 {2315--2357},
  year = 	 {2021},
  editor = 	 {Belkin, Mikhail and Kpotufe, Samory},
  volume = 	 {134},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--19 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v134/haochen21a/haochen21a.pdf},
  url = 	 {https://proceedings.mlr.press/v134/haochen21a.html},
  abstract = 	 {The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise.  This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al.  We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.}
}

Endnote

%0 Conference Paper
%T Shape Matters: Understanding the Implicit Bias of the Noise Covariance
%A Jeff Z. HaoChen
%A Colin Wei
%A Jason Lee
%A Tengyu Ma
%B Proceedings of Thirty Fourth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2021
%E Mikhail Belkin
%E Samory Kpotufe	
%F pmlr-v134-haochen21a
%I PMLR
%P 2315--2357
%U https://proceedings.mlr.press/v134/haochen21a.html
%V 134
%X The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise — induced by mini-batches or label perturbation — is far more effective than Gaussian noise.  This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al.  We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.

APA


HaoChen, J.Z., Wei, C., Lee, J. & Ma, T.. (2021). Shape Matters: Understanding the Implicit Bias of the Noise Covariance. Proceedings of Thirty Fourth Conference on Learning Theory, in Proceedings of Machine Learning Research 134:2315-2357 Available from https://proceedings.mlr.press/v134/haochen21a.html.

Related Material

Download PDF