Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

Weihang Xu; Simon Du

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

Weihang Xu, Simon Du

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:1155-1198, 2023.

Abstract

We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has

$n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a

$O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting (

$n=1$ ) in which the gradient descent enjoys an

$\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an

$\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

Cite this Paper

BibTeX


@InProceedings{pmlr-v195-xu23a,
  title = 	 {Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron},
  author =       {Xu, Weihang and Du, Simon},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {1155--1198},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/xu23a/xu23a.pdf},
  url = 	 {https://proceedings.mlr.press/v195/xu23a.html},
  abstract = 	 {We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a  three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.}
}

Endnote

%0 Conference Paper
%T Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
%A Weihang Xu
%A Simon Du
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-xu23a
%I PMLR
%P 1155--1198
%U https://proceedings.mlr.press/v195/xu23a.html
%V 195
%X We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a  three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

APA


Xu, W. & Du, S.. (2023). Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:1155-1198 Available from https://proceedings.mlr.press/v195/xu23a.html.

Related Material

Download PDF