Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem

Alon Brutzkus, Amir Globerson
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:822-830, 2019.

Abstract

Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparamterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-brutzkus19b, title = {Why do Larger Models Generalize Better? {A} Theoretical Perspective via the {XOR} Problem}, author = {Brutzkus, Alon and Globerson, Amir}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {822--830}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/brutzkus19b/brutzkus19b.pdf}, url = {https://proceedings.mlr.press/v97/brutzkus19b.html}, abstract = {Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparamterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.} }
Endnote
%0 Conference Paper %T Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem %A Alon Brutzkus %A Amir Globerson %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-brutzkus19b %I PMLR %P 822--830 %U https://proceedings.mlr.press/v97/brutzkus19b.html %V 97 %X Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparamterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.
APA
Brutzkus, A. & Globerson, A.. (2019). Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:822-830 Available from https://proceedings.mlr.press/v97/brutzkus19b.html.

Related Material