From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

Ajay Kumar Jaiswal, Yifan Wang, Lu Yin, Shiwei Liu, Runjin Chen, Jiawei Zhao, Ananth Grama, Yuandong Tian, Zhangyang Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:26740-26756, 2025.

Abstract

Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. Firstly, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Secondly, we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that LRCs tend to have better finetuning capabilities and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-jaiswal25a, title = {From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications}, author = {Jaiswal, Ajay Kumar and Wang, Yifan and Yin, Lu and Liu, Shiwei and Chen, Runjin and Zhao, Jiawei and Grama, Ananth and Tian, Yuandong and Wang, Zhangyang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {26740--26756}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jaiswal25a/jaiswal25a.pdf}, url = {https://proceedings.mlr.press/v267/jaiswal25a.html}, abstract = {Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. Firstly, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Secondly, we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that LRCs tend to have better finetuning capabilities and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.} }
Endnote
%0 Conference Paper %T From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications %A Ajay Kumar Jaiswal %A Yifan Wang %A Lu Yin %A Shiwei Liu %A Runjin Chen %A Jiawei Zhao %A Ananth Grama %A Yuandong Tian %A Zhangyang Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-jaiswal25a %I PMLR %P 26740--26756 %U https://proceedings.mlr.press/v267/jaiswal25a.html %V 267 %X Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. Firstly, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Secondly, we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that LRCs tend to have better finetuning capabilities and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.
APA
Jaiswal, A.K., Wang, Y., Yin, L., Liu, S., Chen, R., Zhao, J., Grama, A., Tian, Y. & Wang, Z.. (2025). From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:26740-26756 Available from https://proceedings.mlr.press/v267/jaiswal25a.html.

Related Material