A Convergence Theory for Deep Learning via OverParameterization
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:242252, 2019.
Abstract
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multilayer networks remains unsettled. In this work, we prove simple algorithms such as stochastic gradient descent (SGD) can find Global Minima on the training objective of DNNs in Polynomial Time. We only make two assumptions: the inputs do not degenerate and the network is overparameterized. The latter means the number of hidden neurons is sufficiently large: polynomial in L, the number of DNN layers and in n, the number of training samples. As concrete examples, starting from randomly initialized weights, we show that SGD attains 100% training accuracy in classification tasks, or minimizes regression loss in linear convergence speed eps e^{T}, with running time polynomial in n and L. Our theory applies to the widelyused but nonsmooth ReLU activation, and to any smooth and possibly nonconvex loss functions. In terms of network architectures, our theory at least applies to fullyconnected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
Related Material


