Random Scaling and Momentum for Non-smooth Non-convex Optimization

Qinzi Zhang, Ashok Cutkosky
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:58780-58799, 2024.

Abstract

Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zhang24k, title = {Random Scaling and Momentum for Non-smooth Non-convex Optimization}, author = {Zhang, Qinzi and Cutkosky, Ashok}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {58780--58799}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24k/zhang24k.pdf}, url = {https://proceedings.mlr.press/v235/zhang24k.html}, abstract = {Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.} }
Endnote
%0 Conference Paper %T Random Scaling and Momentum for Non-smooth Non-convex Optimization %A Qinzi Zhang %A Ashok Cutkosky %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zhang24k %I PMLR %P 58780--58799 %U https://proceedings.mlr.press/v235/zhang24k.html %V 235 %X Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.
APA
Zhang, Q. & Cutkosky, A.. (2024). Random Scaling and Momentum for Non-smooth Non-convex Optimization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:58780-58799 Available from https://proceedings.mlr.press/v235/zhang24k.html.

Related Material