On the Generalization Benefit of Noise in Stochastic Gradient Descent

Samuel Smith, Erich Elsen, Soham De
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9058-9067, 2020.

Abstract

It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-smith20a, title = {On the Generalization Benefit of Noise in Stochastic Gradient Descent}, author = {Smith, Samuel and Elsen, Erich and De, Soham}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {9058--9067}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/smith20a/smith20a.pdf}, url = {https://proceedings.mlr.press/v119/smith20a.html}, abstract = {It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.} }
Endnote
%0 Conference Paper %T On the Generalization Benefit of Noise in Stochastic Gradient Descent %A Samuel Smith %A Erich Elsen %A Soham De %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-smith20a %I PMLR %P 9058--9067 %U https://proceedings.mlr.press/v119/smith20a.html %V 119 %X It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.
APA
Smith, S., Elsen, E. & De, S.. (2020). On the Generalization Benefit of Noise in Stochastic Gradient Descent. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:9058-9067 Available from https://proceedings.mlr.press/v119/smith20a.html.

Related Material