High-Probability Bound for Non-Smooth Non-Convex Stochastic Optimization with Heavy Tails

Langqi Liu, Yibo Wang, Lijun Zhang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:32122-32138, 2024.

Abstract

Recently, Cutkosky et al. introduce the online-to-non-convex framework, which utilizes online learning methods to solve non-smooth non-convex optimization problems, and achieves an $\mathcal{O}(\epsilon^{-3}\delta^{-1})$ gradient complexity for finding $(\delta,\epsilon)$-stationary points. However, their results rely on the bounded variance assumption of stochastic gradients and only hold in expectation. To address these limitations, we investigate the case that stochastic gradients obey heavy-tailed distributions with finite $\mathfrak{p}$-th moments for some $\mathfrak{p}\in(1,2]$, and propose a novel algorithm which is able to identify a $(\delta,\epsilon)$-stationary point with high probability, after consuming $\tilde{\mathcal{O}}(\epsilon^{-\frac{2\mathfrak{p}-1}{\mathfrak{p}-1}}\delta^{-1})$ stochastic gradients. The key idea is first incorporating the gradient clipping technique into the online-to-non-convex framework to produce a sequence of points, the averaged gradient norms of which is no greater than $\epsilon$. Then, we propose a validation method to select one $(\delta,\epsilon)$-stationary point among the candidates. When gradient distributions have bounded variance, i.e., $\mathfrak{p}=2$, our result turns into $\tilde{\mathcal{O}}(\epsilon^{-3}\delta^{-1})$, which improves the existing $\tilde{\mathcal{O}}(\epsilon^{-4}\delta^{-1})$ high-probability bound. When the objective is smooth, our algorithm can also find an $\epsilon$-stationary point with $\tilde{\mathcal{O}}(\epsilon^{-\frac{3\mathfrak{p}-2}{\mathfrak{p}-1}})$ gradient queries.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-liu24bo, title = {High-Probability Bound for Non-Smooth Non-Convex Stochastic Optimization with Heavy Tails}, author = {Liu, Langqi and Wang, Yibo and Zhang, Lijun}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {32122--32138}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/liu24bo/liu24bo.pdf}, url = {https://proceedings.mlr.press/v235/liu24bo.html}, abstract = {Recently, Cutkosky et al. introduce the online-to-non-convex framework, which utilizes online learning methods to solve non-smooth non-convex optimization problems, and achieves an $\mathcal{O}(\epsilon^{-3}\delta^{-1})$ gradient complexity for finding $(\delta,\epsilon)$-stationary points. However, their results rely on the bounded variance assumption of stochastic gradients and only hold in expectation. To address these limitations, we investigate the case that stochastic gradients obey heavy-tailed distributions with finite $\mathfrak{p}$-th moments for some $\mathfrak{p}\in(1,2]$, and propose a novel algorithm which is able to identify a $(\delta,\epsilon)$-stationary point with high probability, after consuming $\tilde{\mathcal{O}}(\epsilon^{-\frac{2\mathfrak{p}-1}{\mathfrak{p}-1}}\delta^{-1})$ stochastic gradients. The key idea is first incorporating the gradient clipping technique into the online-to-non-convex framework to produce a sequence of points, the averaged gradient norms of which is no greater than $\epsilon$. Then, we propose a validation method to select one $(\delta,\epsilon)$-stationary point among the candidates. When gradient distributions have bounded variance, i.e., $\mathfrak{p}=2$, our result turns into $\tilde{\mathcal{O}}(\epsilon^{-3}\delta^{-1})$, which improves the existing $\tilde{\mathcal{O}}(\epsilon^{-4}\delta^{-1})$ high-probability bound. When the objective is smooth, our algorithm can also find an $\epsilon$-stationary point with $\tilde{\mathcal{O}}(\epsilon^{-\frac{3\mathfrak{p}-2}{\mathfrak{p}-1}})$ gradient queries.} }
Endnote
%0 Conference Paper %T High-Probability Bound for Non-Smooth Non-Convex Stochastic Optimization with Heavy Tails %A Langqi Liu %A Yibo Wang %A Lijun Zhang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-liu24bo %I PMLR %P 32122--32138 %U https://proceedings.mlr.press/v235/liu24bo.html %V 235 %X Recently, Cutkosky et al. introduce the online-to-non-convex framework, which utilizes online learning methods to solve non-smooth non-convex optimization problems, and achieves an $\mathcal{O}(\epsilon^{-3}\delta^{-1})$ gradient complexity for finding $(\delta,\epsilon)$-stationary points. However, their results rely on the bounded variance assumption of stochastic gradients and only hold in expectation. To address these limitations, we investigate the case that stochastic gradients obey heavy-tailed distributions with finite $\mathfrak{p}$-th moments for some $\mathfrak{p}\in(1,2]$, and propose a novel algorithm which is able to identify a $(\delta,\epsilon)$-stationary point with high probability, after consuming $\tilde{\mathcal{O}}(\epsilon^{-\frac{2\mathfrak{p}-1}{\mathfrak{p}-1}}\delta^{-1})$ stochastic gradients. The key idea is first incorporating the gradient clipping technique into the online-to-non-convex framework to produce a sequence of points, the averaged gradient norms of which is no greater than $\epsilon$. Then, we propose a validation method to select one $(\delta,\epsilon)$-stationary point among the candidates. When gradient distributions have bounded variance, i.e., $\mathfrak{p}=2$, our result turns into $\tilde{\mathcal{O}}(\epsilon^{-3}\delta^{-1})$, which improves the existing $\tilde{\mathcal{O}}(\epsilon^{-4}\delta^{-1})$ high-probability bound. When the objective is smooth, our algorithm can also find an $\epsilon$-stationary point with $\tilde{\mathcal{O}}(\epsilon^{-\frac{3\mathfrak{p}-2}{\mathfrak{p}-1}})$ gradient queries.
APA
Liu, L., Wang, Y. & Zhang, L.. (2024). High-Probability Bound for Non-Smooth Non-Convex Stochastic Optimization with Heavy Tails. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:32122-32138 Available from https://proceedings.mlr.press/v235/liu24bo.html.

Related Material