Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

Weizhong Huang; Yuxin Zhang; Xiawu Zheng; Fei Chao; Rongrong Ji

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:26175-26201, 2025.

Abstract

In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of "reconstruction error explosion" in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70% sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50%, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively. Code is available at https://github.com/wzhuang-xmu/ATP.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-huang25ax,
  title = 	 {Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective},
  author =       {Huang, Weizhong and Zhang, Yuxin and Zheng, Xiawu and Chao, Fei and Ji, Rongrong},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {26175--26201},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/huang25ax/huang25ax.pdf},
  url = 	 {https://proceedings.mlr.press/v267/huang25ax.html},
  abstract = 	 {In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of "reconstruction error explosion" in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70% sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50%, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively. Code is available at https://github.com/wzhuang-xmu/ATP.}
}

Endnote

%0 Conference Paper
%T Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective
%A Weizhong Huang
%A Yuxin Zhang
%A Xiawu Zheng
%A Fei Chao
%A Rongrong Ji
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-huang25ax
%I PMLR
%P 26175--26201
%U https://proceedings.mlr.press/v267/huang25ax.html
%V 267
%X In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of "reconstruction error explosion" in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70% sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50%, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively. Code is available at https://github.com/wzhuang-xmu/ATP.

APA

Huang, W., Zhang, Y., Zheng, X., Chao, F. & Ji, R.. (2025). Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:26175-26201 Available from https://proceedings.mlr.press/v267/huang25ax.html.

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

Abstract

Cite this Paper

Related Material