Implicit Bias of AdamW: -Norm Constrained Optimization

Shuo Xie, Zhiyuan Li
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:54488-54510, 2024.

Abstract

Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with 2 regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with 2 regularization optimizes the 2 regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-xie24e, title = {Implicit Bias of {A}dam{W}: $\ell_∞$-Norm Constrained Optimization}, author = {Xie, Shuo and Li, Zhiyuan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {54488--54510}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/xie24e/xie24e.pdf}, url = {https://proceedings.mlr.press/v235/xie24e.html}, abstract = {Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.} }
Endnote
%0 Conference Paper %T Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization %A Shuo Xie %A Zhiyuan Li %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-xie24e %I PMLR %P 54488--54510 %U https://proceedings.mlr.press/v235/xie24e.html %V 235 %X Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
APA
Xie, S. & Li, Z.. (2024). Implicit Bias of AdamW: $\ell_∞$-Norm Constrained Optimization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:54488-54510 Available from https://proceedings.mlr.press/v235/xie24e.html.

Related Material