Escaping Saddle Points with Adaptive Gradient Methods

Matthew Staib; Sashank Reddi; Satyen Kale; Sanjiv Kumar; Suvrit Sra

Escaping Saddle Points with Adaptive Gradient Methods

Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5956-5965, 2019.

Abstract

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

Cite this Paper

BibTeX

@InProceedings{pmlr-v97-staib19a,
  title = 	 {Escaping Saddle Points with Adaptive Gradient Methods},
  author =       {Staib, Matthew and Reddi, Sashank and Kale, Satyen and Kumar, Sanjiv and Sra, Suvrit},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {5956--5965},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/staib19a/staib19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/staib19a.html},
  abstract = 	 {Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.}
}

Endnote

%0 Conference Paper
%T Escaping Saddle Points with Adaptive Gradient Methods
%A Matthew Staib
%A Sashank Reddi
%A Satyen Kale
%A Sanjiv Kumar
%A Suvrit Sra
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-staib19a
%I PMLR
%P 5956--5965
%U https://proceedings.mlr.press/v97/staib19a.html
%V 97
%X Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

APA

Staib, M., Reddi, S., Kale, S., Kumar, S. & Sra, S.. (2019). Escaping Saddle Points with Adaptive Gradient Methods. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:5956-5965 Available from https://proceedings.mlr.press/v97/staib19a.html.

Escaping Saddle Points with Adaptive Gradient Methods

Abstract

Cite this Paper

Related Material