On the Benefits of Large Learning Rates for Kernel Methods

Gaspard Beugnot, Julien Mairal, Alessandro Rudi
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:254-282, 2022.

Abstract

This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian’s eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-beugnot22a, title = {On the Benefits of Large Learning Rates for Kernel Methods}, author = {Beugnot, Gaspard and Mairal, Julien and Rudi, Alessandro}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {254--282}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/beugnot22a/beugnot22a.pdf}, url = {https://proceedings.mlr.press/v178/beugnot22a.html}, abstract = {This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian’s eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.} }
Endnote
%0 Conference Paper %T On the Benefits of Large Learning Rates for Kernel Methods %A Gaspard Beugnot %A Julien Mairal %A Alessandro Rudi %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-beugnot22a %I PMLR %P 254--282 %U https://proceedings.mlr.press/v178/beugnot22a.html %V 178 %X This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian’s eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
APA
Beugnot, G., Mairal, J. & Rudi, A.. (2022). On the Benefits of Large Learning Rates for Kernel Methods. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:254-282 Available from https://proceedings.mlr.press/v178/beugnot22a.html.

Related Material