Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10269-10333, 2025.

Abstract

Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chezhegov25a, title = {Clipping Improves {A}dam-Norm and {A}da{G}rad-Norm when the Noise Is Heavy-Tailed}, author = {Chezhegov, Savelii and Yaroslav, Klyukin and Semenov, Andrei and Beznosikov, Aleksandr and Gasnikov, Alexander and Horv\'{a}th, Samuel and Tak\'{a}\v{c}, Martin and Gorbunov, Eduard}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10269--10333}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chezhegov25a/chezhegov25a.pdf}, url = {https://proceedings.mlr.press/v267/chezhegov25a.html}, abstract = {Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.} }
Endnote
%0 Conference Paper %T Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed %A Savelii Chezhegov %A Klyukin Yaroslav %A Andrei Semenov %A Aleksandr Beznosikov %A Alexander Gasnikov %A Samuel Horváth %A Martin Takáč %A Eduard Gorbunov %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chezhegov25a %I PMLR %P 10269--10333 %U https://proceedings.mlr.press/v267/chezhegov25a.html %V 267 %X Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.
APA
Chezhegov, S., Yaroslav, K., Semenov, A., Beznosikov, A., Gasnikov, A., Horváth, S., Takáč, M. & Gorbunov, E.. (2025). Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10269-10333 Available from https://proceedings.mlr.press/v267/chezhegov25a.html.

Related Material