[edit]
Learning Over-Parametrized Two-Layer Neural Networks beyond NTK
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:2613-2682, 2020.
Abstract
We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input x∈Rd is drawn from a Gaussian distribution and the label of x satisfies f⋆(x)=a⊤|W⋆x|, where a∈Rd is a nonnegative vector and W⋆∈Rd×d is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most o(1/d) in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in d, has population loss at least Ω(1/d).