Linear Convergence of Gradient Descent For Finite Width Over-parametrized Linear Networks With General Initialization

Ziqing Xu; Hancheng Min; Salma Tarmoun; Enrique Mallada; Rene Vidal

Linear Convergence of Gradient Descent For Finite Width Over-parametrized Linear Networks With General Initialization

Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, Rene Vidal

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:2262-2284, 2023.

Abstract

Recent theoretical analyses of the convergence of gradient descent (GD) to a global minimum for over-parametrized neural networks make strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (spectral, balanced). In this work, we relax these assumptions and derive a linear convergence rate for two-layer linear networks trained using GD on the squared loss in the case of finite step size, finite width and general initialization. Despite the generality of our analysis, our rate estimates are significantly tighter than those of prior work. Moreover, we provide a time-varying step size rule that monotonically improves the convergence rate as the loss function decreases to zero. Numerical experiments validate our findings.

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-xu23c,
  title = 	 {Linear Convergence of Gradient Descent For Finite Width Over-parametrized Linear Networks With General Initialization},
  author =       {Xu, Ziqing and Min, Hancheng and Tarmoun, Salma and Mallada, Enrique and Vidal, Rene},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {2262--2284},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/xu23c/xu23c.pdf},
  url = 	 {https://proceedings.mlr.press/v206/xu23c.html},
  abstract = 	 {Recent theoretical analyses of the convergence of gradient descent (GD) to a global minimum for over-parametrized neural networks make strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (spectral, balanced). In this work, we relax these assumptions and derive a linear convergence rate for two-layer linear networks trained using GD on the squared loss in the case of finite step size, finite width and general initialization. Despite the generality of our analysis, our rate estimates are significantly tighter than those of prior work. Moreover, we provide a time-varying step size rule that monotonically improves the convergence rate as the loss function decreases to zero. Numerical experiments validate our findings.}
}

Endnote

%0 Conference Paper
%T Linear Convergence of Gradient Descent For Finite Width Over-parametrized Linear Networks With General Initialization
%A Ziqing Xu
%A Hancheng Min
%A Salma Tarmoun
%A Enrique Mallada
%A Rene Vidal
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-xu23c
%I PMLR
%P 2262--2284
%U https://proceedings.mlr.press/v206/xu23c.html
%V 206
%X Recent theoretical analyses of the convergence of gradient descent (GD) to a global minimum for over-parametrized neural networks make strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (spectral, balanced). In this work, we relax these assumptions and derive a linear convergence rate for two-layer linear networks trained using GD on the squared loss in the case of finite step size, finite width and general initialization. Despite the generality of our analysis, our rate estimates are significantly tighter than those of prior work. Moreover, we provide a time-varying step size rule that monotonically improves the convergence rate as the loss function decreases to zero. Numerical experiments validate our findings.

APA


Xu, Z., Min, H., Tarmoun, S., Mallada, E. & Vidal, R.. (2023). Linear Convergence of Gradient Descent For Finite Width Over-parametrized Linear Networks With General Initialization. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:2262-2284 Available from https://proceedings.mlr.press/v206/xu23c.html.

Related Material

Download PDF