Benign overfitting of non-smooth neural networks beyond lazy training

Xingyu Xu, Yuantao Gu
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:11094-11117, 2023.

Abstract

Benign overfitting refers to a recently discovered intriguing phenomenon that over-parameterized neural networks, in many cases, can fit the training data perfectly but still generalize well, surprisingly contrary to the traditional belief that overfitting is harmful for generalization. In spite of its surging popularity in recent years, little has been known in the theoretical aspect of benign overfitting of neural networks. In this work, we provide a theoretical analysis of benign overfitting for two-layer neural networks with possibly non-smooth activation function. Without resorting to the popular Neural Tangent Kernel (NTK) approximation, we prove that neural networks can be trained with gradient descent to classify binary-labeled training data perfectly (achieving zero training loss) even in presence of polluted labels, but still generalize well. Our result removes the smoothness assumption in previous literature and goes beyond the NTK regime; this enables a better theoretical understanding of benign overfitting within a practically more meaningful setting, e.g., with (leaky-)ReLU activation function, small random initialization, and finite network width.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-xu23k, title = {Benign overfitting of non-smooth neural networks beyond lazy training}, author = {Xu, Xingyu and Gu, Yuantao}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {11094--11117}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/xu23k/xu23k.pdf}, url = {https://proceedings.mlr.press/v206/xu23k.html}, abstract = {Benign overfitting refers to a recently discovered intriguing phenomenon that over-parameterized neural networks, in many cases, can fit the training data perfectly but still generalize well, surprisingly contrary to the traditional belief that overfitting is harmful for generalization. In spite of its surging popularity in recent years, little has been known in the theoretical aspect of benign overfitting of neural networks. In this work, we provide a theoretical analysis of benign overfitting for two-layer neural networks with possibly non-smooth activation function. Without resorting to the popular Neural Tangent Kernel (NTK) approximation, we prove that neural networks can be trained with gradient descent to classify binary-labeled training data perfectly (achieving zero training loss) even in presence of polluted labels, but still generalize well. Our result removes the smoothness assumption in previous literature and goes beyond the NTK regime; this enables a better theoretical understanding of benign overfitting within a practically more meaningful setting, e.g., with (leaky-)ReLU activation function, small random initialization, and finite network width.} }
Endnote
%0 Conference Paper %T Benign overfitting of non-smooth neural networks beyond lazy training %A Xingyu Xu %A Yuantao Gu %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-xu23k %I PMLR %P 11094--11117 %U https://proceedings.mlr.press/v206/xu23k.html %V 206 %X Benign overfitting refers to a recently discovered intriguing phenomenon that over-parameterized neural networks, in many cases, can fit the training data perfectly but still generalize well, surprisingly contrary to the traditional belief that overfitting is harmful for generalization. In spite of its surging popularity in recent years, little has been known in the theoretical aspect of benign overfitting of neural networks. In this work, we provide a theoretical analysis of benign overfitting for two-layer neural networks with possibly non-smooth activation function. Without resorting to the popular Neural Tangent Kernel (NTK) approximation, we prove that neural networks can be trained with gradient descent to classify binary-labeled training data perfectly (achieving zero training loss) even in presence of polluted labels, but still generalize well. Our result removes the smoothness assumption in previous literature and goes beyond the NTK regime; this enables a better theoretical understanding of benign overfitting within a practically more meaningful setting, e.g., with (leaky-)ReLU activation function, small random initialization, and finite network width.
APA
Xu, X. & Gu, Y.. (2023). Benign overfitting of non-smooth neural networks beyond lazy training. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:11094-11117 Available from https://proceedings.mlr.press/v206/xu23k.html.

Related Material