FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko; Aleksandr Beznosikov; Martin Takáč; Samuel Horváth

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Philip Zmushko, Aleksandr Beznosikov, Martin Takáč, Samuel Horváth

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:80708-80739, 2025.

Abstract

With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the effective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-Rank Updates with GrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zmushko25a,
  title = 	 {{FRUGAL}: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training},
  author =       {Zmushko, Philip and Beznosikov, Aleksandr and Tak\'{a}\v{c}, Martin and Horv\'{a}th, Samuel},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {80708--80739},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zmushko25a/zmushko25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zmushko25a.html},
  abstract = 	 {With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the effective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-Rank Updates with GrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.}
}

Endnote

%0 Conference Paper
%T FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training
%A Philip Zmushko
%A Aleksandr Beznosikov
%A Martin Takáč
%A Samuel Horváth
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zmushko25a
%I PMLR
%P 80708--80739
%U https://proceedings.mlr.press/v267/zmushko25a.html
%V 267
%X With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the effective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-Rank Updates with GrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

APA

Zmushko, P., Beznosikov, A., Takáč, M. & Horváth, S.. (2025). FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:80708-80739 Available from https://proceedings.mlr.press/v267/zmushko25a.html.

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Abstract

Cite this Paper

Related Material