The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

Karthik Abinav Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, Tom Goldstein
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8469-8479, 2020.

Abstract

This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-sankararaman20a, title = {The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent}, author = {Sankararaman, Karthik Abinav and De, Soham and Xu, Zheng and Huang, W. Ronny and Goldstein, Tom}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {8469--8479}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/sankararaman20a/sankararaman20a.pdf}, url = {http://proceedings.mlr.press/v119/sankararaman20a.html}, abstract = {This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.} }
Endnote
%0 Conference Paper %T The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent %A Karthik Abinav Sankararaman %A Soham De %A Zheng Xu %A W. Ronny Huang %A Tom Goldstein %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-sankararaman20a %I PMLR %P 8469--8479 %U http://proceedings.mlr.press/v119/sankararaman20a.html %V 119 %X This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.
APA
Sankararaman, K.A., De, S., Xu, Z., Huang, W.R. & Goldstein, T.. (2020). The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:8469-8479 Available from http://proceedings.mlr.press/v119/sankararaman20a.html.

Related Material