The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, Rachel Ward
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:313-355, 2022.

Abstract

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-faw22a, title = {The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance}, author = {Faw, Matthew and Tziotis, Isidoros and Caramanis, Constantine and Mokhtari, Aryan and Shakkottai, Sanjay and Ward, Rachel}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {313--355}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/faw22a/faw22a.pdf}, url = {https://proceedings.mlr.press/v178/faw22a.html}, abstract = {We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.} }
Endnote
%0 Conference Paper %T The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance %A Matthew Faw %A Isidoros Tziotis %A Constantine Caramanis %A Aryan Mokhtari %A Sanjay Shakkottai %A Rachel Ward %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-faw22a %I PMLR %P 313--355 %U https://proceedings.mlr.press/v178/faw22a.html %V 178 %X We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.
APA
Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S. & Ward, R.. (2022). The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:313-355 Available from https://proceedings.mlr.press/v178/faw22a.html.

Related Material