Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Jingzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:26330-26346, 2022.

Abstract

This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network’s weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-zhang22q, title = {Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective}, author = {Zhang, Jingzhao and Li, Haochuan and Sra, Suvrit and Jadbabaie, Ali}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {26330--26346}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/zhang22q/zhang22q.pdf}, url = {https://proceedings.mlr.press/v162/zhang22q.html}, abstract = {This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network’s weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.} }
Endnote
%0 Conference Paper %T Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective %A Jingzhao Zhang %A Haochuan Li %A Suvrit Sra %A Ali Jadbabaie %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-zhang22q %I PMLR %P 26330--26346 %U https://proceedings.mlr.press/v162/zhang22q.html %V 162 %X This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network’s weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
APA
Zhang, J., Li, H., Sra, S. & Jadbabaie, A.. (2022). Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:26330-26346 Available from https://proceedings.mlr.press/v162/zhang22q.html.

Related Material