How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:5035-5081, 2024.

Abstract

A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow “teacher NN" that agrees with the labels. Specifically, we show that such a ‘flat’ prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent — enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student’s.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-buzaglo24a, title = {How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers}, author = {Buzaglo, Gon and Harel, Itamar and Nacson, Mor Shpigel and Brutzkus, Alon and Srebro, Nathan and Soudry, Daniel}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {5035--5081}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/buzaglo24a/buzaglo24a.pdf}, url = {https://proceedings.mlr.press/v235/buzaglo24a.html}, abstract = {A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow “teacher NN" that agrees with the labels. Specifically, we show that such a ‘flat’ prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent — enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student’s.} }
Endnote
%0 Conference Paper %T How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers %A Gon Buzaglo %A Itamar Harel %A Mor Shpigel Nacson %A Alon Brutzkus %A Nathan Srebro %A Daniel Soudry %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-buzaglo24a %I PMLR %P 5035--5081 %U https://proceedings.mlr.press/v235/buzaglo24a.html %V 235 %X A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow “teacher NN" that agrees with the labels. Specifically, we show that such a ‘flat’ prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent — enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student’s.
APA
Buzaglo, G., Harel, I., Nacson, M.S., Brutzkus, A., Srebro, N. & Soudry, D.. (2024). How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:5035-5081 Available from https://proceedings.mlr.press/v235/buzaglo24a.html.

Related Material