Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Aleksandar Armacki; Dragana Bajović; Dušan Jakovetić; Soummya Kar; Ali H Sayed

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H Sayed

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:337-370, 2026.

Abstract

The study of tail behaviour of \textbf{\texttt{SGD}}-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of \textbf{\texttt{SGD}}-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) \textbf{\texttt{SGD}}, for non-convex costs and bounded noise, with long-term decay at rate $e^{-\frac{t}{\log(t)}}$. Next, we relax the noise assumption by considering clipped \textbf{\texttt{SGD}} (\textbf{\texttt{c-SGD}}) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-\frac{t^{\beta_p}}{\log(t)}}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-\frac{t}{\log^2(t)}}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}} are tight, up to poly-logarithmic factors. Notably, our results demonstrate \textit{an order of magnitude faster} long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}}, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-armacki26a,
  title = 	 {Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization},
  author =       {Armacki, Aleksandar and Bajovi\'{c}, Dragana and Jakoveti\'{c}, Du\v{s}an and Kar, Soummya and Sayed, Ali H},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {337--370},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/armacki26a/armacki26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/armacki26a.html},
  abstract = 	 {The study of tail behaviour of \textbf{\texttt{SGD}}-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of \textbf{\texttt{SGD}}-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) \textbf{\texttt{SGD}}, for non-convex costs and bounded noise, with long-term decay at rate $e^{-\frac{t}{\log(t)}}$. Next, we relax the noise assumption by considering clipped \textbf{\texttt{SGD}} (\textbf{\texttt{c-SGD}}) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-\frac{t^{\beta_p}}{\log(t)}}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-\frac{t}{\log^2(t)}}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}} are tight, up to poly-logarithmic factors. Notably, our results demonstrate \textit{an order of magnitude faster} long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}}, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.}
}

Endnote

%0 Conference Paper
%T Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
%A Aleksandar Armacki
%A Dragana Bajović
%A Dušan Jakovetić
%A Soummya Kar
%A Ali H Sayed
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-armacki26a
%I PMLR
%P 337--370
%U https://proceedings.mlr.press/v336/armacki26a.html
%V 336
%X The study of tail behaviour of \textbf{\texttt{SGD}}-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of \textbf{\texttt{SGD}}-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) \textbf{\texttt{SGD}}, for non-convex costs and bounded noise, with long-term decay at rate $e^{-\frac{t}{\log(t)}}$. Next, we relax the noise assumption by considering clipped \textbf{\texttt{SGD}} (\textbf{\texttt{c-SGD}}) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-\frac{t^{\beta_p}}{\log(t)}}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-\frac{t}{\log^2(t)}}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}} are tight, up to poly-logarithmic factors. Notably, our results demonstrate \textit{an order of magnitude faster} long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for \textbf{\texttt{SGD}} and \textbf{\texttt{c-SGD}}, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

APA

Armacki, A., Bajović, D., Jakovetić, D., Kar, S. & Sayed, A.H.. (2026). Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:337-370 Available from https://proceedings.mlr.press/v336/armacki26a.html.

Related Material

Download PDF