Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

Li Shen; Anke Tang; Yong Luo; Tao Sun; Han Hu; Xiaochun Cao

Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

Li Shen, Anke Tang, Yong Luo, Tao Sun, Han Hu, Xiaochun Cao

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54457-54475, 2025.

Abstract

Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model’s performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50% sparsity, it reduces perplexity by 53.9% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8.6% reduction in model parameters while maintaining a sparsity ratio of about 50%.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-shen25e,
  title = 	 {Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision},
  author =       {Shen, Li and Tang, Anke and Luo, Yong and Sun, Tao and Hu, Han and Cao, Xiaochun},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {54457--54475},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shen25e/shen25e.pdf},
  url = 	 {https://proceedings.mlr.press/v267/shen25e.html},
  abstract = 	 {Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model’s performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50% sparsity, it reduces perplexity by 53.9% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8.6% reduction in model parameters while maintaining a sparsity ratio of about 50%.}
}

Endnote

%0 Conference Paper
%T Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision
%A Li Shen
%A Anke Tang
%A Yong Luo
%A Tao Sun
%A Han Hu
%A Xiaochun Cao
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-shen25e
%I PMLR
%P 54457--54475
%U https://proceedings.mlr.press/v267/shen25e.html
%V 267
%X Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model’s performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50% sparsity, it reduces perplexity by 53.9% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8.6% reduction in model parameters while maintaining a sparsity ratio of about 50%.

APA

Shen, L., Tang, A., Luo, Y., Sun, T., Hu, H. & Cao, X.. (2025). Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54457-54475 Available from https://proceedings.mlr.press/v267/shen25e.html.

Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

Abstract

Cite this Paper

Related Material