Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Tianyang Hu, Wenjia Wang, Cong Lin, Guang Cheng
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:829-837, 2021.

Abstract

Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the L2 estimation error with respect to the GD iteration, which is away from zero without a delicate choice of early stopping. In turn, through a comprehensive analysis of L2-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the L2 regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax optimal rate of the L2 estimation error is achieved. Numerical experiments confirm our theory and further demonstrate that the L2 regularization approach improves the training robustness and works for a wider range of neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-hu21a, title = { Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network }, author = {Hu, Tianyang and Wang, Wenjia and Lin, Cong and Cheng, Guang}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {829--837}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/hu21a/hu21a.pdf}, url = {https://proceedings.mlr.press/v130/hu21a.html}, abstract = { Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the L2 estimation error with respect to the GD iteration, which is away from zero without a delicate choice of early stopping. In turn, through a comprehensive analysis of L2-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the L2 regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax optimal rate of the L2 estimation error is achieved. Numerical experiments confirm our theory and further demonstrate that the L2 regularization approach improves the training robustness and works for a wider range of neural networks. } }
Endnote
%0 Conference Paper %T Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network %A Tianyang Hu %A Wenjia Wang %A Cong Lin %A Guang Cheng %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-hu21a %I PMLR %P 829--837 %U https://proceedings.mlr.press/v130/hu21a.html %V 130 %X Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the L2 estimation error with respect to the GD iteration, which is away from zero without a delicate choice of early stopping. In turn, through a comprehensive analysis of L2-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the L2 regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax optimal rate of the L2 estimation error is achieved. Numerical experiments confirm our theory and further demonstrate that the L2 regularization approach improves the training robustness and works for a wider range of neural networks.
APA
Hu, T., Wang, W., Lin, C. & Cheng, G.. (2021). Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:829-837 Available from https://proceedings.mlr.press/v130/hu21a.html.

Related Material