AdaGrad Avoids Saddle Points

Kimon Antonakopoulos, Panayotis Mertikopoulos, Georgios Piliouras, Xiao Wang
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:731-771, 2022.

Abstract

Adaptive first-order methods in optimization have widespread ML applications due to their ability to adapt to non-convex landscapes. However, their convergence guarantees are typically stated in terms of vanishing gradient norms, which leaves open the issue of converging to undesirable saddle points (or even local maxima). In this paper, we focus on the AdaGrad family of algorithms - from scalar to full-matrix preconditioning - and we examine the question of whether the method’s trajectories avoid saddle points. A major challenge that arises here is that AdaGrad’s step-size (or, more accurately, the method’s preconditioner) evolves over time in a filtration-dependent way, i.e., as a function of all gradients observed in earlier iterations; as a result, avoidance results for methods with a constant or vanishing step-size do not apply. We resolve this challenge by combining a series of step-size stabilization arguments with a recursive representation of the AdaGrad preconditioner that allows us to employ center-stable techniques and ultimately show that the induced trajectories avoid saddle points from almost any initial condition.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-antonakopoulos22a, title = {{A}da{G}rad Avoids Saddle Points}, author = {Antonakopoulos, Kimon and Mertikopoulos, Panayotis and Piliouras, Georgios and Wang, Xiao}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {731--771}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/antonakopoulos22a/antonakopoulos22a.pdf}, url = {https://proceedings.mlr.press/v162/antonakopoulos22a.html}, abstract = {Adaptive first-order methods in optimization have widespread ML applications due to their ability to adapt to non-convex landscapes. However, their convergence guarantees are typically stated in terms of vanishing gradient norms, which leaves open the issue of converging to undesirable saddle points (or even local maxima). In this paper, we focus on the AdaGrad family of algorithms - from scalar to full-matrix preconditioning - and we examine the question of whether the method’s trajectories avoid saddle points. A major challenge that arises here is that AdaGrad’s step-size (or, more accurately, the method’s preconditioner) evolves over time in a filtration-dependent way, i.e., as a function of all gradients observed in earlier iterations; as a result, avoidance results for methods with a constant or vanishing step-size do not apply. We resolve this challenge by combining a series of step-size stabilization arguments with a recursive representation of the AdaGrad preconditioner that allows us to employ center-stable techniques and ultimately show that the induced trajectories avoid saddle points from almost any initial condition.} }
Endnote
%0 Conference Paper %T AdaGrad Avoids Saddle Points %A Kimon Antonakopoulos %A Panayotis Mertikopoulos %A Georgios Piliouras %A Xiao Wang %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-antonakopoulos22a %I PMLR %P 731--771 %U https://proceedings.mlr.press/v162/antonakopoulos22a.html %V 162 %X Adaptive first-order methods in optimization have widespread ML applications due to their ability to adapt to non-convex landscapes. However, their convergence guarantees are typically stated in terms of vanishing gradient norms, which leaves open the issue of converging to undesirable saddle points (or even local maxima). In this paper, we focus on the AdaGrad family of algorithms - from scalar to full-matrix preconditioning - and we examine the question of whether the method’s trajectories avoid saddle points. A major challenge that arises here is that AdaGrad’s step-size (or, more accurately, the method’s preconditioner) evolves over time in a filtration-dependent way, i.e., as a function of all gradients observed in earlier iterations; as a result, avoidance results for methods with a constant or vanishing step-size do not apply. We resolve this challenge by combining a series of step-size stabilization arguments with a recursive representation of the AdaGrad preconditioner that allows us to employ center-stable techniques and ultimately show that the induced trajectories avoid saddle points from almost any initial condition.
APA
Antonakopoulos, K., Mertikopoulos, P., Piliouras, G. & Wang, X.. (2022). AdaGrad Avoids Saddle Points. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:731-771 Available from https://proceedings.mlr.press/v162/antonakopoulos22a.html.

Related Material