[edit]
A Corrective View of Neural Networks: Representation, Memorization and Learning
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:848-901, 2020.
Abstract
We develop a \emph{corrective mechanism} for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks: 1. Two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for $n$ arbitrary points in $\mathbb{R}^d$ with $\tilde{O}(\tfrac{n}{\theta^4})$ ReLUs, where $\theta$ is the minimum distance between two different points. This bound can be shown to be optimal in $n$ up to logarithmic factors. 2. Two-layer neural networks with ReLUs and smoothed ReLUs can represent functions with an error of at most $\epsilon$ with $O(C(a,d)\epsilon^{-1/(a+1)})$ units for $a \in \mathbb{N}\cup\{0\}$ when the function has $\Theta(ad)$ bounded derivatives. In certain cases $d$ can be replaced with effective dimension $q \ll d$. Our results indicate that neural networks with only a single nonlinear layer are surprisingly powerful with regards to representation, and show that in contrast to what is suggested in recent work, depth is not needed in order to represent highly smooth functions. 3. Gradient Descent on the recombination weights of a two-layer random features network with ReLUs and smoothed ReLUs can learn low degree polynomials up to squared error $\epsilon$ with $\mathrm{subpoly}(1/\epsilon)$ units. Even though deep networks can approximate these polynomials with $\mathrm{polylog}(1/\epsilon)$ units, existing \emph{learning} bounds for this problem require $\mathrm{poly}(1/\epsilon)$ units. To the best of our knowledge, our results give the first sub-polynomial learning guarantees for this problem.