An Alternative View: When Does SGD Escape Local Minima?

Bobby Kleinberg, Yuanzhi Li, Yang Yuan
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2698-2707, 2018.

Abstract

Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function $f$ has many bad local minima or saddle points, as long as for every point $x$, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution $x^*$, SGD will get close to, and then stay around $x^*$ with constant probability. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-kleinberg18a, title = {An Alternative View: When Does {SGD} Escape Local Minima?}, author = {Kleinberg, Bobby and Li, Yuanzhi and Yuan, Yang}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {2698--2707}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/kleinberg18a/kleinberg18a.pdf}, url = {https://proceedings.mlr.press/v80/kleinberg18a.html}, abstract = {Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function $f$ has many bad local minima or saddle points, as long as for every point $x$, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution $x^*$, SGD will get close to, and then stay around $x^*$ with constant probability. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.} }
Endnote
%0 Conference Paper %T An Alternative View: When Does SGD Escape Local Minima? %A Bobby Kleinberg %A Yuanzhi Li %A Yang Yuan %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-kleinberg18a %I PMLR %P 2698--2707 %U https://proceedings.mlr.press/v80/kleinberg18a.html %V 80 %X Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function $f$ has many bad local minima or saddle points, as long as for every point $x$, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution $x^*$, SGD will get close to, and then stay around $x^*$ with constant probability. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.
APA
Kleinberg, B., Li, Y. & Yuan, Y.. (2018). An Alternative View: When Does SGD Escape Local Minima?. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:2698-2707 Available from https://proceedings.mlr.press/v80/kleinberg18a.html.

Related Material