How much does Initialization Affect Generalization?

Sameera Ramasinghe, Lachlan Ewen Macdonald, Moshiur Farazi, Hemanth Saratchandran, Simon Lucey
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:28637-28655, 2023.

Abstract

Characterizing the remarkable generalization properties of over-parameterized neural networks remains an open problem. A growing body of recent literature shows that the bias of stochastic gradient descent (SGD) and architecture choice implicitly leads to better generalization. In this paper, we show on the contrary that, independently of architecture, SGD can itself be the cause of poor generalization if one does not ensure good initialization. Specifically, we prove that any differentiably parameterized model, trained under gradient flow, obeys a weak spectral bias law which states that sufficiently high frequencies train arbitrarily slowly. This implies that very high frequencies present at initialization will remain after training, and hamper generalization. Further, we empirically test the developed theoretical insights using practical, deep networks. Finally, we contrast our framework with that supplied by the flat-minima conjecture and show that Fourier analysis grants a more reliable framework for understanding the generalization of neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-ramasinghe23a, title = {How much does Initialization Affect Generalization?}, author = {Ramasinghe, Sameera and Macdonald, Lachlan Ewen and Farazi, Moshiur and Saratchandran, Hemanth and Lucey, Simon}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {28637--28655}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/ramasinghe23a/ramasinghe23a.pdf}, url = {https://proceedings.mlr.press/v202/ramasinghe23a.html}, abstract = {Characterizing the remarkable generalization properties of over-parameterized neural networks remains an open problem. A growing body of recent literature shows that the bias of stochastic gradient descent (SGD) and architecture choice implicitly leads to better generalization. In this paper, we show on the contrary that, independently of architecture, SGD can itself be the cause of poor generalization if one does not ensure good initialization. Specifically, we prove that any differentiably parameterized model, trained under gradient flow, obeys a weak spectral bias law which states that sufficiently high frequencies train arbitrarily slowly. This implies that very high frequencies present at initialization will remain after training, and hamper generalization. Further, we empirically test the developed theoretical insights using practical, deep networks. Finally, we contrast our framework with that supplied by the flat-minima conjecture and show that Fourier analysis grants a more reliable framework for understanding the generalization of neural networks.} }
Endnote
%0 Conference Paper %T How much does Initialization Affect Generalization? %A Sameera Ramasinghe %A Lachlan Ewen Macdonald %A Moshiur Farazi %A Hemanth Saratchandran %A Simon Lucey %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-ramasinghe23a %I PMLR %P 28637--28655 %U https://proceedings.mlr.press/v202/ramasinghe23a.html %V 202 %X Characterizing the remarkable generalization properties of over-parameterized neural networks remains an open problem. A growing body of recent literature shows that the bias of stochastic gradient descent (SGD) and architecture choice implicitly leads to better generalization. In this paper, we show on the contrary that, independently of architecture, SGD can itself be the cause of poor generalization if one does not ensure good initialization. Specifically, we prove that any differentiably parameterized model, trained under gradient flow, obeys a weak spectral bias law which states that sufficiently high frequencies train arbitrarily slowly. This implies that very high frequencies present at initialization will remain after training, and hamper generalization. Further, we empirically test the developed theoretical insights using practical, deep networks. Finally, we contrast our framework with that supplied by the flat-minima conjecture and show that Fourier analysis grants a more reliable framework for understanding the generalization of neural networks.
APA
Ramasinghe, S., Macdonald, L.E., Farazi, M., Saratchandran, H. & Lucey, S.. (2023). How much does Initialization Affect Generalization?. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:28637-28655 Available from https://proceedings.mlr.press/v202/ramasinghe23a.html.

Related Material