Learning Over-Parametrized Two-Layer Neural Networks beyond NTK

Yuanzhi Li; Tengyu Ma; Hongyang R. Zhang

Learning Over-Parametrized Two-Layer Neural Networks beyond NTK

Yuanzhi Li, Tengyu Ma, Hongyang R. Zhang

Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:2613-2682, 2020.

Abstract

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input

$x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of

$x$ satisfies

$f^{\star}(x) = a^{\top}|W^{\star}x|$ , where

$a\in\mathbb{R}^d$ is a nonnegative vector and

$W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most

$o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in

$d$ , has population loss at least

$\Omega(1 / d)$ .

Cite this Paper

BibTeX


@InProceedings{pmlr-v125-li20a,
  title = 	 {Learning Over-Parametrized Two-Layer Neural Networks beyond NTK},
  author =       {Li, Yuanzhi and Ma, Tengyu and Zhang, Hongyang R.},
  booktitle = 	 {Proceedings of Thirty Third Conference on Learning Theory},
  pages = 	 {2613--2682},
  year = 	 {2020},
  editor = 	 {Abernethy, Jacob and Agarwal, Shivani},
  volume = 	 {125},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v125/li20a/li20a.pdf},
  url = 	 {https://proceedings.mlr.press/v125/li20a.html},
  abstract = 	 { We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.}
}

Endnote

%0 Conference Paper
%T Learning Over-Parametrized Two-Layer Neural Networks beyond NTK
%A Yuanzhi Li
%A Tengyu Ma
%A Hongyang R. Zhang
%B Proceedings of Thirty Third Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2020
%E Jacob Abernethy
%E Shivani Agarwal	
%F pmlr-v125-li20a
%I PMLR
%P 2613--2682
%U https://proceedings.mlr.press/v125/li20a.html
%V 125
%X  We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.

APA


Li, Y., Ma, T. & Zhang, H.R.. (2020). Learning Over-Parametrized Two-Layer Neural Networks beyond NTK. Proceedings of Thirty Third Conference on Learning Theory, in Proceedings of Machine Learning Research 125:2613-2682 Available from https://proceedings.mlr.press/v125/li20a.html.

Related Material

Download PDF