On the Implicit Bias of Adam

Matias D. Cattaneo, Jason Matthew Klusowski, Boris Shigida
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:5862-5906, 2024.

Abstract

In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-cattaneo24a, title = {On the Implicit Bias of {A}dam}, author = {Cattaneo, Matias D. and Klusowski, Jason Matthew and Shigida, Boris}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {5862--5906}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/cattaneo24a/cattaneo24a.pdf}, url = {https://proceedings.mlr.press/v235/cattaneo24a.html}, abstract = {In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.} }
Endnote
%0 Conference Paper %T On the Implicit Bias of Adam %A Matias D. Cattaneo %A Jason Matthew Klusowski %A Boris Shigida %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-cattaneo24a %I PMLR %P 5862--5906 %U https://proceedings.mlr.press/v235/cattaneo24a.html %V 235 %X In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different "norm" involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.
APA
Cattaneo, M.D., Klusowski, J.M. & Shigida, B.. (2024). On the Implicit Bias of Adam. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:5862-5906 Available from https://proceedings.mlr.press/v235/cattaneo24a.html.

Related Material