The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Daniel Park, Jascha Sohl-Dickstein, Quoc Le, Samuel Smith
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5042-5051, 2019.

Abstract

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-park19b, title = {The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study}, author = {Park, Daniel and Sohl-Dickstein, Jascha and Le, Quoc and Smith, Samuel}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {5042--5051}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/park19b/park19b.pdf}, url = {https://proceedings.mlr.press/v97/park19b.html}, abstract = {We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.} }
Endnote
%0 Conference Paper %T The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study %A Daniel Park %A Jascha Sohl-Dickstein %A Quoc Le %A Samuel Smith %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-park19b %I PMLR %P 5042--5051 %U https://proceedings.mlr.press/v97/park19b.html %V 97 %X We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.
APA
Park, D., Sohl-Dickstein, J., Le, Q. & Smith, S.. (2019). The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:5042-5051 Available from https://proceedings.mlr.press/v97/park19b.html.

Related Material