Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

Ilja Kuzborskij, Csaba Szepesvari
Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:2853-2890, 2021.

Abstract

We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell_2$-regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).

Cite this Paper


BibTeX
@InProceedings{pmlr-v134-kuzborskij21a, title = {Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping}, author = {Kuzborskij, Ilja and Szepesvari, Csaba}, booktitle = {Proceedings of Thirty Fourth Conference on Learning Theory}, pages = {2853--2890}, year = {2021}, editor = {Belkin, Mikhail and Kpotufe, Samory}, volume = {134}, series = {Proceedings of Machine Learning Research}, month = {15--19 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v134/kuzborskij21a/kuzborskij21a.pdf}, url = {https://proceedings.mlr.press/v134/kuzborskij21a.html}, abstract = {We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell_2$-regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).} }
Endnote
%0 Conference Paper %T Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping %A Ilja Kuzborskij %A Csaba Szepesvari %B Proceedings of Thirty Fourth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2021 %E Mikhail Belkin %E Samory Kpotufe %F pmlr-v134-kuzborskij21a %I PMLR %P 2853--2890 %U https://proceedings.mlr.press/v134/kuzborskij21a.html %V 134 %X We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell_2$-regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).
APA
Kuzborskij, I. & Szepesvari, C.. (2021). Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping. Proceedings of Thirty Fourth Conference on Learning Theory, in Proceedings of Machine Learning Research 134:2853-2890 Available from https://proceedings.mlr.press/v134/kuzborskij21a.html.

Related Material