APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60812-60831, 2024.

Abstract

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models’ performance with 70% parameters remaining. Furthermore, APT speeds up LMs’ fine-tuning by up to 8$\times$ and reduces large LMs’ memory training footprint by up to 70%. Our code and models are publicly available at https://github.com/ROIM1998/APT.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zhao24g, title = {{APT}: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference}, author = {Zhao, Bowen and Hajishirzi, Hannaneh and Cao, Qingqing}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {60812--60831}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhao24g/zhao24g.pdf}, url = {https://proceedings.mlr.press/v235/zhao24g.html}, abstract = {Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models’ performance with 70% parameters remaining. Furthermore, APT speeds up LMs’ fine-tuning by up to 8$\times$ and reduces large LMs’ memory training footprint by up to 70%. Our code and models are publicly available at https://github.com/ROIM1998/APT.} }
Endnote
%0 Conference Paper %T APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference %A Bowen Zhao %A Hannaneh Hajishirzi %A Qingqing Cao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zhao24g %I PMLR %P 60812--60831 %U https://proceedings.mlr.press/v235/zhao24g.html %V 235 %X Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models’ performance with 70% parameters remaining. Furthermore, APT speeds up LMs’ fine-tuning by up to 8$\times$ and reduces large LMs’ memory training footprint by up to 70%. Our code and models are publicly available at https://github.com/ROIM1998/APT.
APA
Zhao, B., Hajishirzi, H. & Cao, Q.. (2024). APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:60812-60831 Available from https://proceedings.mlr.press/v235/zhao24g.html.

Related Material