Avoiding spurious sharpness minimization broadens applicability of SAM

Sidak Pal Singh; Hossein Mobahi; Atish Agarwala; Yann Dauphin

Avoiding spurious sharpness minimization broadens applicability of SAM

Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, Yann Dauphin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:55702-55719, 2025.

Abstract

Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance — even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics — instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs)

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-singh25b,
  title = 	 {Avoiding spurious sharpness minimization broadens applicability of {SAM}},
  author =       {Singh, Sidak Pal and Mobahi, Hossein and Agarwala, Atish and Dauphin, Yann},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {55702--55719},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/singh25b/singh25b.pdf},
  url = 	 {https://proceedings.mlr.press/v267/singh25b.html},
  abstract = 	 {Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance — even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics — instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs)}
}

Endnote

%0 Conference Paper
%T Avoiding spurious sharpness minimization broadens applicability of SAM
%A Sidak Pal Singh
%A Hossein Mobahi
%A Atish Agarwala
%A Yann Dauphin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-singh25b
%I PMLR
%P 55702--55719
%U https://proceedings.mlr.press/v267/singh25b.html
%V 267
%X Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance — even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics — instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs)

APA

Singh, S.P., Mobahi, H., Agarwala, A. & Dauphin, Y.. (2025). Avoiding spurious sharpness minimization broadens applicability of SAM. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:55702-55719 Available from https://proceedings.mlr.press/v267/singh25b.html.

Avoiding spurious sharpness minimization broadens applicability of SAM

Abstract

Cite this Paper

Related Material