MARS: Unleashing the Power of Variance Reduction for Training Large Models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, Zhou Xun, Quanquan Gu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:73553-73587, 2025.

Abstract

Training deep neural networks–and more recently, large models–demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yuan25f, title = {{MARS}: Unleashing the Power of Variance Reduction for Training Large Models}, author = {Yuan, Huizhuo and Liu, Yifeng and Wu, Shuang and Xun, Zhou and Gu, Quanquan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {73553--73587}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yuan25f/yuan25f.pdf}, url = {https://proceedings.mlr.press/v267/yuan25f.html}, abstract = {Training deep neural networks–and more recently, large models–demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.} }
Endnote
%0 Conference Paper %T MARS: Unleashing the Power of Variance Reduction for Training Large Models %A Huizhuo Yuan %A Yifeng Liu %A Shuang Wu %A Zhou Xun %A Quanquan Gu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yuan25f %I PMLR %P 73553--73587 %U https://proceedings.mlr.press/v267/yuan25f.html %V 267 %X Training deep neural networks–and more recently, large models–demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
APA
Yuan, H., Liu, Y., Wu, S., Xun, Z. & Gu, Q.. (2025). MARS: Unleashing the Power of Variance Reduction for Training Large Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:73553-73587 Available from https://proceedings.mlr.press/v267/yuan25f.html.

Related Material