Subspace Optimization for Large Language Models with Convergence Guarantees

Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22468-22522, 2025.

Abstract

Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-he25i, title = {Subspace Optimization for Large Language Models with Convergence Guarantees}, author = {He, Yutong and Li, Pengrui and Hu, Yipeng and Chen, Chuyan and Yuan, Kun}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22468--22522}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/he25i/he25i.pdf}, url = {https://proceedings.mlr.press/v267/he25i.html}, abstract = {Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.} }
Endnote
%0 Conference Paper %T Subspace Optimization for Large Language Models with Convergence Guarantees %A Yutong He %A Pengrui Li %A Yipeng Hu %A Chuyan Chen %A Kun Yuan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-he25i %I PMLR %P 22468--22522 %U https://proceedings.mlr.press/v267/he25i.html %V 267 %X Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.
APA
He, Y., Li, P., Hu, Y., Chen, C. & Yuan, K.. (2025). Subspace Optimization for Large Language Models with Convergence Guarantees. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22468-22522 Available from https://proceedings.mlr.press/v267/he25i.html.

Related Material