Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

Ohad Shamir
Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:2691-2713, 2019.

Abstract

We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth $k$. We also show empirically that this phenomenon can occur in higher dimensions, where each $w_i$ is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where $k$ is large.

Cite this Paper


BibTeX
@InProceedings{pmlr-v99-shamir19a, title = {Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks}, author = {Shamir, Ohad}, booktitle = {Proceedings of the Thirty-Second Conference on Learning Theory}, pages = {2691--2713}, year = {2019}, editor = {Beygelzimer, Alina and Hsu, Daniel}, volume = {99}, series = {Proceedings of Machine Learning Research}, month = {25--28 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v99/shamir19a/shamir19a.pdf}, url = {https://proceedings.mlr.press/v99/shamir19a.html}, abstract = {We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth $k$. We also show empirically that this phenomenon can occur in higher dimensions, where each $w_i$ is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where $k$ is large.} }
Endnote
%0 Conference Paper %T Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks %A Ohad Shamir %B Proceedings of the Thirty-Second Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2019 %E Alina Beygelzimer %E Daniel Hsu %F pmlr-v99-shamir19a %I PMLR %P 2691--2713 %U https://proceedings.mlr.press/v99/shamir19a.html %V 99 %X We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth $k$. We also show empirically that this phenomenon can occur in higher dimensions, where each $w_i$ is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where $k$ is large.
APA
Shamir, O.. (2019). Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks. Proceedings of the Thirty-Second Conference on Learning Theory, in Proceedings of Machine Learning Research 99:2691-2713 Available from https://proceedings.mlr.press/v99/shamir19a.html.

Related Material