Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:46121-46135, 2024.

Abstract

With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. $\textbf{P}$arameter-$\textbf{E}$fficient $\textbf{F}$ine-$\textbf{T}$uning(PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named $\textbf{S}$parse $\textbf{I}$ncrement $\textbf{F}$ine-$\textbf{T}$uning(SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-song24e, title = {Sparse is Enough in Fine-tuning Pre-trained Large Language Models}, author = {Song, Weixi and Li, Zuchao and Zhang, Lefei and Zhao, Hai and Du, Bo}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {46121--46135}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/song24e/song24e.pdf}, url = {https://proceedings.mlr.press/v235/song24e.html}, abstract = {With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. $\textbf{P}$arameter-$\textbf{E}$fficient $\textbf{F}$ine-$\textbf{T}$uning(PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named $\textbf{S}$parse $\textbf{I}$ncrement $\textbf{F}$ine-$\textbf{T}$uning(SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.} }
Endnote
%0 Conference Paper %T Sparse is Enough in Fine-tuning Pre-trained Large Language Models %A Weixi Song %A Zuchao Li %A Lefei Zhang %A Hai Zhao %A Bo Du %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-song24e %I PMLR %P 46121--46135 %U https://proceedings.mlr.press/v235/song24e.html %V 235 %X With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. $\textbf{P}$arameter-$\textbf{E}$fficient $\textbf{F}$ine-$\textbf{T}$uning(PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named $\textbf{S}$parse $\textbf{I}$ncrement $\textbf{F}$ine-$\textbf{T}$uning(SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
APA
Song, W., Li, Z., Zhang, L., Zhao, H. & Du, B.. (2024). Sparse is Enough in Fine-tuning Pre-trained Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:46121-46135 Available from https://proceedings.mlr.press/v235/song24e.html.

Related Material