Provable Accelerated Convergence of Nesterov’s Momentum for Deep ReLU Neural Networks

Fangshuo Liao, Anastasios Kyrillidis
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:732-784, 2024.

Abstract

Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov’s momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov’s momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which constitutes this work as the first that proves an accelerated convergence rate for non-trivial neural network architectures.

Cite this Paper


BibTeX
@InProceedings{pmlr-v237-liao24a, title = {Provable Accelerated Convergence of Nesterov’s Momentum for Deep ReLU Neural Networks}, author = {Liao, Fangshuo and Kyrillidis, Anastasios}, booktitle = {Proceedings of The 35th International Conference on Algorithmic Learning Theory}, pages = {732--784}, year = {2024}, editor = {Vernade, Claire and Hsu, Daniel}, volume = {237}, series = {Proceedings of Machine Learning Research}, month = {25--28 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v237/liao24a/liao24a.pdf}, url = {https://proceedings.mlr.press/v237/liao24a.html}, abstract = {Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov’s momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov’s momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which constitutes this work as the first that proves an accelerated convergence rate for non-trivial neural network architectures.} }
Endnote
%0 Conference Paper %T Provable Accelerated Convergence of Nesterov’s Momentum for Deep ReLU Neural Networks %A Fangshuo Liao %A Anastasios Kyrillidis %B Proceedings of The 35th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Claire Vernade %E Daniel Hsu %F pmlr-v237-liao24a %I PMLR %P 732--784 %U https://proceedings.mlr.press/v237/liao24a.html %V 237 %X Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov’s momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov’s momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which constitutes this work as the first that proves an accelerated convergence rate for non-trivial neural network architectures.
APA
Liao, F. & Kyrillidis, A.. (2024). Provable Accelerated Convergence of Nesterov’s Momentum for Deep ReLU Neural Networks. Proceedings of The 35th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 237:732-784 Available from https://proceedings.mlr.press/v237/liao24a.html.

Related Material