Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Hristo Papazov, Scott Pesme, Nicolas Flammarion
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3556-3564, 2024.

Abstract

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-papazov24a, title = {Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks}, author = {Papazov, Hristo and Pesme, Scott and Flammarion, Nicolas}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {3556--3564}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/papazov24a/papazov24a.pdf}, url = {https://proceedings.mlr.press/v238/papazov24a.html}, abstract = {In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.} }
Endnote
%0 Conference Paper %T Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks %A Hristo Papazov %A Scott Pesme %A Nicolas Flammarion %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-papazov24a %I PMLR %P 3556--3564 %U https://proceedings.mlr.press/v238/papazov24a.html %V 238 %X In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.
APA
Papazov, H., Pesme, S. & Flammarion, N.. (2024). Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3556-3564 Available from https://proceedings.mlr.press/v238/papazov24a.html.

Related Material