[edit]
Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:1155-1198, 2023.
Abstract
We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has n≥2 neurons. We prove the global convergence of randomly initialized gradient descent with a O(T−3) rate. This is the first global convergence result for this problem beyond the exact-parameterization setting (n=1) in which the gradient descent enjoys an exp(−Ω(T)) rate. Perhaps surprisingly, we further present an Ω(T−3) lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.