Olica: Efficient Structured Pruning of Large Language Models without Retraining

Jiujun He; Huazhen Lin

Olica: Efficient Structured Pruning of Large Language Models without Retraining

Jiujun He, Huazhen Lin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22580-22594, 2025.

Abstract

Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose an efficient pruning framework for LLMs called Orthogonal Neuron Decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products (i.e., ${\rm W}_q{\rm W}^{\top}_k$ and ${\rm W}_v{\rm W}^{\top}_o$). By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. Moreover, a fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of a pruned layer using two low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-he25m,
  title = 	 {Olica: Efficient Structured Pruning of Large Language Models without Retraining},
  author =       {He, Jiujun and Lin, Huazhen},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {22580--22594},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/he25m/he25m.pdf},
  url = 	 {https://proceedings.mlr.press/v267/he25m.html},
  abstract = 	 {Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose an efficient pruning framework for LLMs called Orthogonal Neuron Decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products (i.e., ${\rm W}_q{\rm W}^{\top}_k$ and ${\rm W}_v{\rm W}^{\top}_o$). By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. Moreover, a fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of a pruned layer using two low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.}
}

Endnote

%0 Conference Paper
%T Olica: Efficient Structured Pruning of Large Language Models without Retraining
%A Jiujun He
%A Huazhen Lin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-he25m
%I PMLR
%P 22580--22594
%U https://proceedings.mlr.press/v267/he25m.html
%V 267
%X Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose an efficient pruning framework for LLMs called Orthogonal Neuron Decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products (i.e., ${\rm W}_q{\rm W}^{\top}_k$ and ${\rm W}_v{\rm W}^{\top}_o$). By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. Moreover, a fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of a pruned layer using two low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

APA

He, J. & Lin, H.. (2025). Olica: Efficient Structured Pruning of Large Language Models without Retraining. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22580-22594 Available from https://proceedings.mlr.press/v267/he25m.html.

Olica: Efficient Structured Pruning of Large Language Models without Retraining

Abstract

Cite this Paper

Related Material