How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark

Eldar Kurtic, Torsten Hoefler, Dan Alistarh
Conference on Parsimony and Learning, PMLR 234:542-553, 2024.

Abstract

Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent “Sparsity May Cry” (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v234-kurtic24a, title = {How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark}, author = {Kurtic, Eldar and Hoefler, Torsten and Alistarh, Dan}, booktitle = {Conference on Parsimony and Learning}, pages = {542--553}, year = {2024}, editor = {Chi, Yuejie and Dziugaite, Gintare Karolina and Qu, Qing and Wang, Atlas Wang and Zhu, Zhihui}, volume = {234}, series = {Proceedings of Machine Learning Research}, month = {03--06 Jan}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v234/kurtic24a/kurtic24a.pdf}, url = {https://proceedings.mlr.press/v234/kurtic24a.html}, abstract = {Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent “Sparsity May Cry” (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.} }
Endnote
%0 Conference Paper %T How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark %A Eldar Kurtic %A Torsten Hoefler %A Dan Alistarh %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2024 %E Yuejie Chi %E Gintare Karolina Dziugaite %E Qing Qu %E Atlas Wang Wang %E Zhihui Zhu %F pmlr-v234-kurtic24a %I PMLR %P 542--553 %U https://proceedings.mlr.press/v234/kurtic24a.html %V 234 %X Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent “Sparsity May Cry” (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets, and propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark. First, we perform a cost-vs-benefits analysis of pruning model components, such as the embeddings and the classification head; second, we provide a simple-yet-general way of scaling training, sparsification and learning rate schedules relative to the desired target sparsity; finally, we investigate the importance of proper parametrization for Knowledge Distillation in the context of LLMs. Our simple insights lead to state-of-the-art results, both on classic BERT-pruning benchmarks, as well as on the SMC benchmark, showing that even classic gradual magnitude pruning (GMP) can yield competitive results, with the right approach.
APA
Kurtic, E., Hoefler, T. & Alistarh, D.. (2024). How to Prune Your Language Model: Recovering Accuracy on the “Sparsity May Cry” Benchmark. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 234:542-553 Available from https://proceedings.mlr.press/v234/kurtic24a.html.

Related Material