Width Provably Matters in Optimization for Deep Linear Neural Networks

Simon Du; Wei Hu

Width Provably Matters in Optimization for Deep Linear Neural Networks

Simon Du, Wei Hu

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:1655-1664, 2019.

Abstract

We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $\epsilon$-suboptimal solution is $O(\kappa \log(\frac{1}{\epsilon}))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(\Omega\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v97-du19a,
  title = 	 {Width Provably Matters in Optimization for Deep Linear Neural Networks},
  author =       {Du, Simon and Hu, Wei},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {1655--1664},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/du19a/du19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/du19a.html},
  abstract = 	 {We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $\epsilon$-suboptimal solution is $O(\kappa \log(\frac{1}{\epsilon}))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(\Omega\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.}
}

Endnote

%0 Conference Paper
%T Width Provably Matters in Optimization for Deep Linear Neural Networks
%A Simon Du
%A Wei Hu
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-du19a
%I PMLR
%P 1655--1664
%U https://proceedings.mlr.press/v97/du19a.html
%V 97
%X We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an $\epsilon$-suboptimal solution is $O(\kappa \log(\frac{1}{\epsilon}))$. Our polynomial upper bound on the total running time for wide deep linear networks and the $\exp\left(\Omega\left(L\right)\right)$ lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

APA

Du, S. & Hu, W.. (2019). Width Provably Matters in Optimization for Deep Linear Neural Networks. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1655-1664 Available from https://proceedings.mlr.press/v97/du19a.html.

Width Provably Matters in Optimization for Deep Linear Neural Networks

Abstract

Cite this Paper

Related Material