Tight analyses for non-smooth stochastic gradient descent

Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, Sikander Randhawa
Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:1579-1613, 2019.

Abstract

Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ \emph{with high probability}. We also construct a function from this class for which the error of the final iterate of \emph{deterministic} gradient descent is $\Omega(\log(T)/T)$. This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). An intermediate step of our analysis proves that the suffix averaging method achieves error $O(1/T)$ \emph{with high probability}, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error $O(1/T)$, but only in expectation, and achieved a high probability error bound of $O(\log \log(T)/T)$, which is suboptimal.

Cite this Paper


BibTeX
@InProceedings{pmlr-v99-harvey19a, title = {Tight analyses for non-smooth stochastic gradient descent}, author = {Harvey, Nicholas J.~A. and Liaw, Christopher and Plan, Yaniv and Randhawa, Sikander}, booktitle = {Proceedings of the Thirty-Second Conference on Learning Theory}, pages = {1579--1613}, year = {2019}, editor = {Beygelzimer, Alina and Hsu, Daniel}, volume = {99}, series = {Proceedings of Machine Learning Research}, month = {25--28 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v99/harvey19a/harvey19a.pdf}, url = {https://proceedings.mlr.press/v99/harvey19a.html}, abstract = {Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ \emph{with high probability}. We also construct a function from this class for which the error of the final iterate of \emph{deterministic} gradient descent is $\Omega(\log(T)/T)$. This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). An intermediate step of our analysis proves that the suffix averaging method achieves error $O(1/T)$ \emph{with high probability}, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error $O(1/T)$, but only in expectation, and achieved a high probability error bound of $O(\log \log(T)/T)$, which is suboptimal.} }
Endnote
%0 Conference Paper %T Tight analyses for non-smooth stochastic gradient descent %A Nicholas J. A. Harvey %A Christopher Liaw %A Yaniv Plan %A Sikander Randhawa %B Proceedings of the Thirty-Second Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2019 %E Alina Beygelzimer %E Daniel Hsu %F pmlr-v99-harvey19a %I PMLR %P 1579--1613 %U https://proceedings.mlr.press/v99/harvey19a.html %V 99 %X Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ \emph{with high probability}. We also construct a function from this class for which the error of the final iterate of \emph{deterministic} gradient descent is $\Omega(\log(T)/T)$. This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). An intermediate step of our analysis proves that the suffix averaging method achieves error $O(1/T)$ \emph{with high probability}, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error $O(1/T)$, but only in expectation, and achieved a high probability error bound of $O(\log \log(T)/T)$, which is suboptimal.
APA
Harvey, N.J., Liaw, C., Plan, Y. & Randhawa, S.. (2019). Tight analyses for non-smooth stochastic gradient descent. Proceedings of the Thirty-Second Conference on Learning Theory, in Proceedings of Machine Learning Research 99:1579-1613 Available from https://proceedings.mlr.press/v99/harvey19a.html.

Related Material