On the Training Instability of Shuffling SGD with Batch Normalization

David Xing Wu, Chulhee Yun, Suvrit Sra
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:37787-37845, 2023.

Abstract

We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR)—two widely used variants of SGD—interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalized inputs, we prove that SS and RR converge to distinct global optima that are “distorted” away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-wu23x, title = {On the Training Instability of Shuffling {SGD} with Batch Normalization}, author = {Wu, David Xing and Yun, Chulhee and Sra, Suvrit}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {37787--37845}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/wu23x/wu23x.pdf}, url = {https://proceedings.mlr.press/v202/wu23x.html}, abstract = {We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR)—two widely used variants of SGD—interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalized inputs, we prove that SS and RR converge to distinct global optima that are “distorted” away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.} }
Endnote
%0 Conference Paper %T On the Training Instability of Shuffling SGD with Batch Normalization %A David Xing Wu %A Chulhee Yun %A Suvrit Sra %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-wu23x %I PMLR %P 37787--37845 %U https://proceedings.mlr.press/v202/wu23x.html %V 202 %X We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR)—two widely used variants of SGD—interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalized inputs, we prove that SS and RR converge to distinct global optima that are “distorted” away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.
APA
Wu, D.X., Yun, C. & Sra, S.. (2023). On the Training Instability of Shuffling SGD with Batch Normalization. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:37787-37845 Available from https://proceedings.mlr.press/v202/wu23x.html.

Related Material