Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H Sayed
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:337-370, 2026.

Abstract

The study of tail behaviour of \textbf{\texttt{SGD}}-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of \textbf{\texttt{SGD}}-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) \textbf{\texttt{SGD}}, for non-convex costs and bounded noise, with long-term decay at rate $e^{-\frac{t}{\log(t)}}$. Next, we relax the noise assumption by considering clipped \textbf{\texttt{SGD}} (\textbf{\texttt{c-SGD}}) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-\frac{t^{\beta_p}}{\log(t)}}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-\frac{t}{\log^2(t)}}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}} are tight, up to poly-logarithmic factors. Notably, our results demonstrate \textit{an order of magnitude faster} long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}}, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-armacki26a, title = {Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization}, author = {Armacki, Aleksandar and Bajovi\'{c}, Dragana and Jakoveti\'{c}, Du\v{s}an and Kar, Soummya and Sayed, Ali H}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {337--370}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/armacki26a/armacki26a.pdf}, url = {https://proceedings.mlr.press/v336/armacki26a.html}, abstract = {The study of tail behaviour of \textbf{\texttt{SGD}}-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of \textbf{\texttt{SGD}}-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) \textbf{\texttt{SGD}}, for non-convex costs and bounded noise, with long-term decay at rate $e^{-\frac{t}{\log(t)}}$. Next, we relax the noise assumption by considering clipped \textbf{\texttt{SGD}} (\textbf{\texttt{c-SGD}}) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-\frac{t^{\beta_p}}{\log(t)}}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-\frac{t}{\log^2(t)}}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}} are tight, up to poly-logarithmic factors. Notably, our results demonstrate \textit{an order of magnitude faster} long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}}, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.} }
Endnote
%0 Conference Paper %T Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization %A Aleksandar Armacki %A Dragana Bajović %A Dušan Jakovetić %A Soummya Kar %A Ali H Sayed %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-armacki26a %I PMLR %P 337--370 %U https://proceedings.mlr.press/v336/armacki26a.html %V 336 %X The study of tail behaviour of \textbf{\texttt{SGD}}-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of \textbf{\texttt{SGD}}-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) \textbf{\texttt{SGD}}, for non-convex costs and bounded noise, with long-term decay at rate $e^{-\frac{t}{\log(t)}}$. Next, we relax the noise assumption by considering clipped \textbf{\texttt{SGD}} (\textbf{\texttt{c-SGD}}) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-\frac{t^{\beta_p}}{\log(t)}}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-\frac{t}{\log^2(t)}}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}} are tight, up to poly-logarithmic factors. Notably, our results demonstrate \textit{an order of magnitude faster} long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}}, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.
APA
Armacki, A., Bajović, D., Jakovetić, D., Kar, S. & Sayed, A.H.. (2026). Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:337-370 Available from https://proceedings.mlr.press/v336/armacki26a.html.

Related Material