[edit]
Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent
Proceedings of the 31st Conference On Learning Theory, PMLR 75:1042-1085, 2018.
Abstract
Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern \emph{nonconvex} applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov’s AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, matching the best known convergence rate, which is faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting—all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called \emph{improve or localize}, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.