A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Yiming Chen; Yuan Zhang; Yin Liu; Kun Yuan; Zaiwen Wen

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:9747-9763, 2025.

Abstract

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-chen25cj,
  title = 	 {A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models},
  author =       {Chen, Yiming and Zhang, Yuan and Liu, Yin and Yuan, Kun and Wen, Zaiwen},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {9747--9763},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25cj/chen25cj.pdf},
  url = 	 {https://proceedings.mlr.press/v267/chen25cj.html},
  abstract = 	 {The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.}
}

Endnote

%0 Conference Paper
%T A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
%A Yiming Chen
%A Yuan Zhang
%A Yin Liu
%A Kun Yuan
%A Zaiwen Wen
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-chen25cj
%I PMLR
%P 9747--9763
%U https://proceedings.mlr.press/v267/chen25cj.html
%V 267
%X The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

APA

Chen, Y., Zhang, Y., Liu, Y., Yuan, K. & Wen, Z.. (2025). A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:9747-9763 Available from https://proceedings.mlr.press/v267/chen25cj.html.

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Abstract

Cite this Paper

Related Material