Unique Properties of Flat Minima in Deep Networks

Rotem Mulayoff, Tomer Michaeli
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7108-7118, 2020.

Abstract

It is well known that (stochastic) gradient descent has an implicit bias towards flat minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the flat minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the flattest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in flat minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-mulayoff20a, title = {Unique Properties of Flat Minima in Deep Networks}, author = {Mulayoff, Rotem and Michaeli, Tomer}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {7108--7118}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/mulayoff20a/mulayoff20a.pdf}, url = {https://proceedings.mlr.press/v119/mulayoff20a.html}, abstract = {It is well known that (stochastic) gradient descent has an implicit bias towards flat minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the flat minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the flattest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in flat minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.} }
Endnote
%0 Conference Paper %T Unique Properties of Flat Minima in Deep Networks %A Rotem Mulayoff %A Tomer Michaeli %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-mulayoff20a %I PMLR %P 7108--7118 %U https://proceedings.mlr.press/v119/mulayoff20a.html %V 119 %X It is well known that (stochastic) gradient descent has an implicit bias towards flat minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the flat minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the flattest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in flat minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.
APA
Mulayoff, R. & Michaeli, T.. (2020). Unique Properties of Flat Minima in Deep Networks. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7108-7118 Available from https://proceedings.mlr.press/v119/mulayoff20a.html.

Related Material