A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs

Mehdi Makni; Kayhan Behdin; Zheng Xu; Natalia Ponomareva; Rahul Mazumder

A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs

Mehdi Makni, Kayhan Behdin, Zheng Xu, Natalia Ponomareva, Rahul Mazumder

Conference on Parsimony and Learning, PMLR 280:484-499, 2025.

Abstract

The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models’ dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined $\texttt{HASSLE-free}$ for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the $\textit{exact}$ introduced optimization problem. $\texttt{HASSLE-free}$ substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, $\texttt{HASSLE-free}$ reduces the test perplexity by $18$% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by $28$% compared to existing methods.

Cite this Paper

BibTeX

@InProceedings{pmlr-v280-makni25a,
  title = 	 {A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs},
  author =       {Makni, Mehdi and Behdin, Kayhan and Xu, Zheng and Ponomareva, Natalia and Mazumder, Rahul},
  booktitle = 	 {Conference on Parsimony and Learning},
  pages = 	 {484--499},
  year = 	 {2025},
  editor = 	 {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui},
  volume = 	 {280},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {24--27 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v280/main/assets/makni25a/makni25a.pdf},
  url = 	 {https://proceedings.mlr.press/v280/makni25a.html},
  abstract = 	 {The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models’ dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined $\texttt{HASSLE-free}$ for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the $\textit{exact}$ introduced optimization problem.  $\texttt{HASSLE-free}$ substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, $\texttt{HASSLE-free}$ reduces the test perplexity by $18$% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by $28$% compared to existing methods.}
}

Endnote

%0 Conference Paper
%T A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs
%A Mehdi Makni
%A Kayhan Behdin
%A Zheng Xu
%A Natalia Ponomareva
%A Rahul Mazumder
%B Conference on Parsimony and Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Beidi Chen
%E Shijia Liu
%E Mert Pilanci
%E Weijie Su
%E Jeremias Sulam
%E Yuxiang Wang
%E Zhihui Zhu	
%F pmlr-v280-makni25a
%I PMLR
%P 484--499
%U https://proceedings.mlr.press/v280/makni25a.html
%V 280
%X The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models’ dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined $\texttt{HASSLE-free}$ for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the $\textit{exact}$ introduced optimization problem.  $\texttt{HASSLE-free}$ substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, $\texttt{HASSLE-free}$ reduces the test perplexity by $18$% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by $28$% compared to existing methods.

APA

Makni, M., Behdin, K., Xu, Z., Ponomareva, N. & Mazumder, R.. (2025). A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:484-499 Available from https://proceedings.mlr.press/v280/makni25a.html.

A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs

Abstract

Cite this Paper

Related Material