Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song; Zuchao Li; Lefei Zhang; Hai Zhao; Bo Du

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:46121-46135, 2024.

Abstract

With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue.

$\textbf{P}$ arameter-

$\textbf{E}$ fficient

$\textbf{F}$ ine-

$\textbf{T}$ uning(PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named

$\textbf{S}$ parse

$\textbf{I}$ ncrement

$\textbf{F}$ ine-

$\textbf{T}$ uning(SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-song24e,
  title = 	 {Sparse is Enough in Fine-tuning Pre-trained Large Language Models},
  author =       {Song, Weixi and Li, Zuchao and Zhang, Lefei and Zhao, Hai and Du, Bo},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {46121--46135},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/song24e/song24e.pdf},
  url = 	 {https://proceedings.mlr.press/v235/song24e.html},
  abstract = 	 {With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. $\textbf{P}$arameter-$\textbf{E}$fficient $\textbf{F}$ine-$\textbf{T}$uning(PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named $\textbf{S}$parse $\textbf{I}$ncrement $\textbf{F}$ine-$\textbf{T}$uning(SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.}
}

Endnote

%0 Conference Paper
%T Sparse is Enough in Fine-tuning Pre-trained Large Language Models
%A Weixi Song
%A Zuchao Li
%A Lefei Zhang
%A Hai Zhao
%A Bo Du
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-song24e
%I PMLR
%P 46121--46135
%U https://proceedings.mlr.press/v235/song24e.html
%V 235
%X With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. $\textbf{P}$arameter-$\textbf{E}$fficient $\textbf{F}$ine-$\textbf{T}$uning(PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named $\textbf{S}$parse $\textbf{I}$ncrement $\textbf{F}$ine-$\textbf{T}$uning(SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.

APA


Song, W., Li, Z., Zhang, L., Zhao, H. & Du, B.. (2024). Sparse is Enough in Fine-tuning Pre-trained Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:46121-46135 Available from https://proceedings.mlr.press/v235/song24e.html.

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Abstract

Cite this Paper

Related Material