Width Provably Matters in Optimization for Deep Linear Neural Networks

Simon Du, Wei Hu
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:1655-1664, 2019.

Abstract

We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $\epsilon$-suboptimal solution is $O(\kappa \log(\frac{1}{\epsilon}))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(\Omega\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-du19a, title = {Width Provably Matters in Optimization for Deep Linear Neural Networks}, author = {Du, Simon and Hu, Wei}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {1655--1664}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/du19a/du19a.pdf}, url = {https://proceedings.mlr.press/v97/du19a.html}, abstract = {We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $\epsilon$-suboptimal solution is $O(\kappa \log(\frac{1}{\epsilon}))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(\Omega\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.} }
Endnote
%0 Conference Paper %T Width Provably Matters in Optimization for Deep Linear Neural Networks %A Simon Du %A Wei Hu %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-du19a %I PMLR %P 1655--1664 %U https://proceedings.mlr.press/v97/du19a.html %V 97 %X We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $\epsilon$-suboptimal solution is $O(\kappa \log(\frac{1}{\epsilon}))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(\Omega\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.
APA
Du, S. & Hu, W.. (2019). Width Provably Matters in Optimization for Deep Linear Neural Networks. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1655-1664 Available from https://proceedings.mlr.press/v97/du19a.html.

Related Material