Neural networks trained with SGD learn distributions of increasing complexity

Maria Refinetti, Alessandro Ingrosso, Sebastian Goldt
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:28843-28863, 2023.

Abstract

The uncanny ability of over-parameterised neural networks to generalise well has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first fitting simple, linear classifiers before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a single neuron trained on synthetic data. We then demonstrate DSB empirically in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-refinetti23a, title = {Neural networks trained with {SGD} learn distributions of increasing complexity}, author = {Refinetti, Maria and Ingrosso, Alessandro and Goldt, Sebastian}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {28843--28863}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/refinetti23a/refinetti23a.pdf}, url = {https://proceedings.mlr.press/v202/refinetti23a.html}, abstract = {The uncanny ability of over-parameterised neural networks to generalise well has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first fitting simple, linear classifiers before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a single neuron trained on synthetic data. We then demonstrate DSB empirically in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.} }
Endnote
%0 Conference Paper %T Neural networks trained with SGD learn distributions of increasing complexity %A Maria Refinetti %A Alessandro Ingrosso %A Sebastian Goldt %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-refinetti23a %I PMLR %P 28843--28863 %U https://proceedings.mlr.press/v202/refinetti23a.html %V 202 %X The uncanny ability of over-parameterised neural networks to generalise well has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first fitting simple, linear classifiers before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a single neuron trained on synthetic data. We then demonstrate DSB empirically in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.
APA
Refinetti, M., Ingrosso, A. & Goldt, S.. (2023). Neural networks trained with SGD learn distributions of increasing complexity. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:28843-28863 Available from https://proceedings.mlr.press/v202/refinetti23a.html.

Related Material