Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

Jianwei Li; Yijun Dong; Qi Lei

Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

Jianwei Li, Yijun Dong, Qi Lei

Conference on Parsimony and Learning, PMLR 280:500-520, 2025.

Abstract

To remove redundant components of large language models (LLMs) without incurring significant pruning costs, this work focuses on single-shot structured pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our strategy significantly reduces pruning costs and hardware requirements while maintaining superior performance across various datasets and models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v280-li25b,
  title = 	 {Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining},
  author =       {Li, Jianwei and Dong, Yijun and Lei, Qi},
  booktitle = 	 {Conference on Parsimony and Learning},
  pages = 	 {500--520},
  year = 	 {2025},
  editor = 	 {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui},
  volume = 	 {280},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {24--27 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v280/main/assets/li25b/li25b.pdf},
  url = 	 {https://proceedings.mlr.press/v280/li25b.html},
  abstract = 	 {To remove redundant components of large language models (LLMs) without incurring significant pruning costs, this work focuses on single-shot structured pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our strategy significantly reduces pruning costs and hardware requirements while maintaining superior performance across various datasets and models.}
}

Endnote

%0 Conference Paper
%T Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
%A Jianwei Li
%A Yijun Dong
%A Qi Lei
%B Conference on Parsimony and Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Beidi Chen
%E Shijia Liu
%E Mert Pilanci
%E Weijie Su
%E Jeremias Sulam
%E Yuxiang Wang
%E Zhihui Zhu	
%F pmlr-v280-li25b
%I PMLR
%P 500--520
%U https://proceedings.mlr.press/v280/li25b.html
%V 280
%X To remove redundant components of large language models (LLMs) without incurring significant pruning costs, this work focuses on single-shot structured pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our strategy significantly reduces pruning costs and hardware requirements while maintaining superior performance across various datasets and models.

APA

Li, J., Dong, Y. & Lei, Q.. (2025). Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:500-520 Available from https://proceedings.mlr.press/v280/li25b.html.

Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

Abstract

Cite this Paper

Related Material