LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Guangyan Li, Yongqiang Tang, Wensheng Zhang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:28657-28672, 2024.

Abstract

Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we get an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a novel structured compression method LoRAP, which organically combines Low-Rank matrix approximation And structured Pruning. For the MHA sub-layer, we proposal an input activation weighted singular value decomposition method and allocate different parameter amounts for each weight matrix based on the differences in low-rank properties of matrices.For the FFN sub-layer, we propose a gradient-free structured channel pruning method and save the least important 1% of parameters which actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios. Our code will be released soon.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-li24bi, title = {{L}o{RAP}: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models}, author = {Li, Guangyan and Tang, Yongqiang and Zhang, Wensheng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {28657--28672}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/li24bi/li24bi.pdf}, url = {https://proceedings.mlr.press/v235/li24bi.html}, abstract = {Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we get an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a novel structured compression method LoRAP, which organically combines Low-Rank matrix approximation And structured Pruning. For the MHA sub-layer, we proposal an input activation weighted singular value decomposition method and allocate different parameter amounts for each weight matrix based on the differences in low-rank properties of matrices.For the FFN sub-layer, we propose a gradient-free structured channel pruning method and save the least important 1% of parameters which actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios. Our code will be released soon.} }
Endnote
%0 Conference Paper %T LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models %A Guangyan Li %A Yongqiang Tang %A Wensheng Zhang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-li24bi %I PMLR %P 28657--28672 %U https://proceedings.mlr.press/v235/li24bi.html %V 235 %X Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we get an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a novel structured compression method LoRAP, which organically combines Low-Rank matrix approximation And structured Pruning. For the MHA sub-layer, we proposal an input activation weighted singular value decomposition method and allocate different parameter amounts for each weight matrix based on the differences in low-rank properties of matrices.For the FFN sub-layer, we propose a gradient-free structured channel pruning method and save the least important 1% of parameters which actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios. Our code will be released soon.
APA
Li, G., Tang, Y. & Zhang, W.. (2024). LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:28657-28672 Available from https://proceedings.mlr.press/v235/li24bi.html.

Related Material