Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio, Konstantin Mishchenko
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7449-7479, 2023.

Abstract

The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of D yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An implementation is provided in the supplementary material.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-defazio23a, title = {Learning-Rate-Free Learning by D-Adaptation}, author = {Defazio, Aaron and Mishchenko, Konstantin}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {7449--7479}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/defazio23a/defazio23a.pdf}, url = {https://proceedings.mlr.press/v202/defazio23a.html}, abstract = {The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of D yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An implementation is provided in the supplementary material.} }
Endnote
%0 Conference Paper %T Learning-Rate-Free Learning by D-Adaptation %A Aaron Defazio %A Konstantin Mishchenko %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-defazio23a %I PMLR %P 7449--7479 %U https://proceedings.mlr.press/v202/defazio23a.html %V 202 %X The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of D yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An implementation is provided in the supplementary material.
APA
Defazio, A. & Mishchenko, K.. (2023). Learning-Rate-Free Learning by D-Adaptation. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:7449-7479 Available from https://proceedings.mlr.press/v202/defazio23a.html.

Related Material