How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark

Eldar Kurtic; Torsten Hoefler; Dan Alistarh

How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark

Eldar Kurtic, Torsten Hoefler, Dan Alistarh

Conference on Parsimony and Learning, PMLR 234:542-553, 2024.

Abstract

Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent “Sparsity May Cry” (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.

Cite this Paper

BibTeX

@InProceedings{pmlr-v234-kurtic24a,
  title = 	 {How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark},
  author =       {Kurtic, Eldar and Hoefler, Torsten and Alistarh, Dan},
  booktitle = 	 {Conference on Parsimony and Learning},
  pages = 	 {542--553},
  year = 	 {2024},
  editor = 	 {Chi, Yuejie and Dziugaite, Gintare Karolina and Qu, Qing and Wang, Atlas Wang and Zhu, Zhihui},
  volume = 	 {234},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--06 Jan},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v234/kurtic24a/kurtic24a.pdf},
  url = 	 {https://proceedings.mlr.press/v234/kurtic24a.html},
  abstract = 	 {Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task.  The recent “Sparsity May Cry” (SMC) benchmark  put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail.  We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training,  sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.}
}

Endnote

%0 Conference Paper
%T How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark
%A Eldar Kurtic
%A Torsten Hoefler
%A Dan Alistarh
%B Conference on Parsimony and Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Yuejie Chi
%E Gintare Karolina Dziugaite
%E Qing Qu
%E Atlas Wang Wang
%E Zhihui Zhu	
%F pmlr-v234-kurtic24a
%I PMLR
%P 542--553
%U https://proceedings.mlr.press/v234/kurtic24a.html
%V 234
%X Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task.  The recent “Sparsity May Cry” (SMC) benchmark  put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail.  We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training,  sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.

APA

Kurtic, E., Hoefler, T. & Alistarh, D.. (2024). How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 234:542-553 Available from https://proceedings.mlr.press/v234/kurtic24a.html.

How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark

Abstract

Cite this Paper

Related Material