An SDE for Modeling SAM: Theory and Insights

Enea Monzio Compagnoni, Luca Biggio, Antonio Orvieto, Frank Norbert Proske, Hans Kersting, Aurelien Lucchi
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:25209-25253, 2023.

Abstract

We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones – by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-monzio-compagnoni23a, title = {An {SDE} for Modeling {SAM}: Theory and Insights}, author = {Monzio Compagnoni, Enea and Biggio, Luca and Orvieto, Antonio and Proske, Frank Norbert and Kersting, Hans and Lucchi, Aurelien}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {25209--25253}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/monzio-compagnoni23a/monzio-compagnoni23a.pdf}, url = {https://proceedings.mlr.press/v202/monzio-compagnoni23a.html}, abstract = {We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones – by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.} }
Endnote
%0 Conference Paper %T An SDE for Modeling SAM: Theory and Insights %A Enea Monzio Compagnoni %A Luca Biggio %A Antonio Orvieto %A Frank Norbert Proske %A Hans Kersting %A Aurelien Lucchi %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-monzio-compagnoni23a %I PMLR %P 25209--25253 %U https://proceedings.mlr.press/v202/monzio-compagnoni23a.html %V 202 %X We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones – by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
APA
Monzio Compagnoni, E., Biggio, L., Orvieto, A., Proske, F.N., Kersting, H. & Lucchi, A.. (2023). An SDE for Modeling SAM: Theory and Insights. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:25209-25253 Available from https://proceedings.mlr.press/v202/monzio-compagnoni23a.html.

Related Material