SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:41907-41942, 2025.

Abstract

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b) In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve similar performance as Adam for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves 50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLaMA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ma25g, title = {{SWAN}: {SGD} with Normalization and Whitening Enables Stateless {LLM} Training}, author = {Ma, Chao and Gong, Wenbo and Scetbon, Meyer and Meeds, Edward}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {41907--41942}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ma25g/ma25g.pdf}, url = {https://proceedings.mlr.press/v267/ma25g.html}, abstract = {Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b) In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve similar performance as Adam for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves 50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLaMA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model.} }
Endnote
%0 Conference Paper %T SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training %A Chao Ma %A Wenbo Gong %A Meyer Scetbon %A Edward Meeds %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ma25g %I PMLR %P 41907--41942 %U https://proceedings.mlr.press/v267/ma25g.html %V 267 %X Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b) In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve similar performance as Adam for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves 50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLaMA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model.
APA
Ma, C., Gong, W., Scetbon, M. & Meeds, E.. (2025). SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:41907-41942 Available from https://proceedings.mlr.press/v267/ma25g.html.

Related Material