A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

Yongqiang Cai, Qianxiao Li, Zuowei Shen
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:882-890, 2019.

Abstract

Despite its empirical success and recent theoretical progress, there generally lacks a quantitative analysis of the effect of batch normalization (BN) on the convergence and stability of gradient descent. In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS), where the precise dynamical properties of gradient descent (GD) is completely known, thus allowing us to isolate and compare the additional effects of BN. More precisely, we show that unlike GD, gradient descent with BN (BNGD) converges for arbitrary learning rates for the weights, and the convergence remains linear under mild conditions. Moreover, we quantify two different sources of acceleration of BNGD over GD – one due to over-parameterization which improves the effective condition number and another due having a large range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and could account for much of its robustness properties. These findings are confirmed quantitatively by numerical experiments, which further show that many of the uncovered properties of BNGD in OLS are also observed qualitatively in more complex supervised learning problems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-cai19a, title = {A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent}, author = {Cai, Yongqiang and Li, Qianxiao and Shen, Zuowei}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {882--890}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/cai19a/cai19a.pdf}, url = {https://proceedings.mlr.press/v97/cai19a.html}, abstract = {Despite its empirical success and recent theoretical progress, there generally lacks a quantitative analysis of the effect of batch normalization (BN) on the convergence and stability of gradient descent. In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS), where the precise dynamical properties of gradient descent (GD) is completely known, thus allowing us to isolate and compare the additional effects of BN. More precisely, we show that unlike GD, gradient descent with BN (BNGD) converges for arbitrary learning rates for the weights, and the convergence remains linear under mild conditions. Moreover, we quantify two different sources of acceleration of BNGD over GD – one due to over-parameterization which improves the effective condition number and another due having a large range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and could account for much of its robustness properties. These findings are confirmed quantitatively by numerical experiments, which further show that many of the uncovered properties of BNGD in OLS are also observed qualitatively in more complex supervised learning problems.} }
Endnote
%0 Conference Paper %T A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent %A Yongqiang Cai %A Qianxiao Li %A Zuowei Shen %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-cai19a %I PMLR %P 882--890 %U https://proceedings.mlr.press/v97/cai19a.html %V 97 %X Despite its empirical success and recent theoretical progress, there generally lacks a quantitative analysis of the effect of batch normalization (BN) on the convergence and stability of gradient descent. In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS), where the precise dynamical properties of gradient descent (GD) is completely known, thus allowing us to isolate and compare the additional effects of BN. More precisely, we show that unlike GD, gradient descent with BN (BNGD) converges for arbitrary learning rates for the weights, and the convergence remains linear under mild conditions. Moreover, we quantify two different sources of acceleration of BNGD over GD – one due to over-parameterization which improves the effective condition number and another due having a large range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and could account for much of its robustness properties. These findings are confirmed quantitatively by numerical experiments, which further show that many of the uncovered properties of BNGD in OLS are also observed qualitatively in more complex supervised learning problems.
APA
Cai, Y., Li, Q. & Shen, Z.. (2019). A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:882-890 Available from https://proceedings.mlr.press/v97/cai19a.html.

Related Material