Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Yonatan Dukler, Quanquan Gu, Guido Montufar
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:2751-2760, 2020.

Abstract

The current paradigm of deep neural networks has been successful in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice as they improve the generalization performance and training speed of neural networks significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting. We bridge this gap by providing the first global convergence result for 2 layer non-linear neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. The analysis shows how the introduction of normalization layers changes the optimization landscape and in some settings enables faster convergence as compared with un-normalized neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-dukler20a, title = {Optimization Theory for {R}e{LU} Neural Networks Trained with Normalization Layers}, author = {Dukler, Yonatan and Gu, Quanquan and Montufar, Guido}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {2751--2760}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/dukler20a/dukler20a.pdf}, url = {http://proceedings.mlr.press/v119/dukler20a.html}, abstract = {The current paradigm of deep neural networks has been successful in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice as they improve the generalization performance and training speed of neural networks significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting. We bridge this gap by providing the first global convergence result for 2 layer non-linear neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. The analysis shows how the introduction of normalization layers changes the optimization landscape and in some settings enables faster convergence as compared with un-normalized neural networks.} }
Endnote
%0 Conference Paper %T Optimization Theory for ReLU Neural Networks Trained with Normalization Layers %A Yonatan Dukler %A Quanquan Gu %A Guido Montufar %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-dukler20a %I PMLR %P 2751--2760 %U http://proceedings.mlr.press/v119/dukler20a.html %V 119 %X The current paradigm of deep neural networks has been successful in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice as they improve the generalization performance and training speed of neural networks significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting. We bridge this gap by providing the first global convergence result for 2 layer non-linear neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. The analysis shows how the introduction of normalization layers changes the optimization landscape and in some settings enables faster convergence as compared with un-normalized neural networks.
APA
Dukler, Y., Gu, Q. & Montufar, G.. (2020). Optimization Theory for ReLU Neural Networks Trained with Normalization Layers. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:2751-2760 Available from http://proceedings.mlr.press/v119/dukler20a.html.

Related Material