Learning Over-Parametrized Two-Layer Neural Networks beyond NTK

Yuanzhi Li, Tengyu Ma, Hongyang R. Zhang
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:2613-2682, 2020.

Abstract

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v125-li20a, title = {Learning Over-Parametrized Two-Layer Neural Networks beyond NTK}, author = {Li, Yuanzhi and Ma, Tengyu and Zhang, Hongyang R.}, booktitle = {Proceedings of Thirty Third Conference on Learning Theory}, pages = {2613--2682}, year = {2020}, editor = {Abernethy, Jacob and Agarwal, Shivani}, volume = {125}, series = {Proceedings of Machine Learning Research}, month = {09--12 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v125/li20a/li20a.pdf}, url = {https://proceedings.mlr.press/v125/li20a.html}, abstract = { We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.} }
Endnote
%0 Conference Paper %T Learning Over-Parametrized Two-Layer Neural Networks beyond NTK %A Yuanzhi Li %A Tengyu Ma %A Hongyang R. Zhang %B Proceedings of Thirty Third Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2020 %E Jacob Abernethy %E Shivani Agarwal %F pmlr-v125-li20a %I PMLR %P 2613--2682 %U https://proceedings.mlr.press/v125/li20a.html %V 125 %X We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.
APA
Li, Y., Ma, T. & Zhang, H.R.. (2020). Learning Over-Parametrized Two-Layer Neural Networks beyond NTK. Proceedings of Thirty Third Conference on Learning Theory, in Proceedings of Machine Learning Research 125:2613-2682 Available from https://proceedings.mlr.press/v125/li20a.html.

Related Material