Deep Learning Over-Parameterization: the Shallow Fallacy

Pierre Baldi
Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), PMLR 233:7-12, 2024.

Abstract

A major tenet of conventional wisdom dictates that models should not be over-parameterized: the number of free parameters should not exceed the number of training data points. This tenet originates from centuries of shallow learning, primarily in the form of linear or logistic regression. It is routinely applied to all kinds of data analyses and modeling and even to infer properties of the brain. However, through a variety of precise mathematical examples, we show that this conventional wisdom is completely wrong as soon as one moves from shallow to deep learning. In particular, we construct sequences of both linear and non-linear deep learning models whose number of parameters can grow to infinity, while the training set can remain very small (e.g. a single example). In deep models, the parameter space is partitioned into large equivalence classes. Learning can be viewed as a communication process where information is communicated from the data to the synaptic weights. The information in the training data only needs to specify an equivalence class of the parameters, and not the exact parameter values. As such, the number of training examples can be significantly smaller than the number of free parameters.

Cite this Paper


BibTeX
@InProceedings{pmlr-v233-baldi24a, title = {Deep Learning Over-Parameterization: the Shallow Fallacy}, author = {Baldi, Pierre}, booktitle = {Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL})}, pages = {7--12}, year = {2024}, editor = {Lutchyn, Tetiana and Ramírez Rivera, Adín and Ricaud, Benjamin}, volume = {233}, series = {Proceedings of Machine Learning Research}, month = {09--11 Jan}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v233/baldi24a/baldi24a.pdf}, url = {https://proceedings.mlr.press/v233/baldi24a.html}, abstract = {A major tenet of conventional wisdom dictates that models should not be over-parameterized: the number of free parameters should not exceed the number of training data points. This tenet originates from centuries of shallow learning, primarily in the form of linear or logistic regression. It is routinely applied to all kinds of data analyses and modeling and even to infer properties of the brain. However, through a variety of precise mathematical examples, we show that this conventional wisdom is completely wrong as soon as one moves from shallow to deep learning. In particular, we construct sequences of both linear and non-linear deep learning models whose number of parameters can grow to infinity, while the training set can remain very small (e.g. a single example). In deep models, the parameter space is partitioned into large equivalence classes. Learning can be viewed as a communication process where information is communicated from the data to the synaptic weights. The information in the training data only needs to specify an equivalence class of the parameters, and not the exact parameter values. As such, the number of training examples can be significantly smaller than the number of free parameters.} }
Endnote
%0 Conference Paper %T Deep Learning Over-Parameterization: the Shallow Fallacy %A Pierre Baldi %B Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}) %C Proceedings of Machine Learning Research %D 2024 %E Tetiana Lutchyn %E Adín Ramírez Rivera %E Benjamin Ricaud %F pmlr-v233-baldi24a %I PMLR %P 7--12 %U https://proceedings.mlr.press/v233/baldi24a.html %V 233 %X A major tenet of conventional wisdom dictates that models should not be over-parameterized: the number of free parameters should not exceed the number of training data points. This tenet originates from centuries of shallow learning, primarily in the form of linear or logistic regression. It is routinely applied to all kinds of data analyses and modeling and even to infer properties of the brain. However, through a variety of precise mathematical examples, we show that this conventional wisdom is completely wrong as soon as one moves from shallow to deep learning. In particular, we construct sequences of both linear and non-linear deep learning models whose number of parameters can grow to infinity, while the training set can remain very small (e.g. a single example). In deep models, the parameter space is partitioned into large equivalence classes. Learning can be viewed as a communication process where information is communicated from the data to the synaptic weights. The information in the training data only needs to specify an equivalence class of the parameters, and not the exact parameter values. As such, the number of training examples can be significantly smaller than the number of free parameters.
APA
Baldi, P.. (2024). Deep Learning Over-Parameterization: the Shallow Fallacy. Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), in Proceedings of Machine Learning Research 233:7-12 Available from https://proceedings.mlr.press/v233/baldi24a.html.

Related Material