Low-rank fine-tuning lies between lazy training and feature learning

Arif Kerem Dayi; Sitan Chen

Low-rank fine-tuning lies between lazy training and feature learning

Arif Kerem Dayi, Sitan Chen

Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:1415-1471, 2025.

Abstract

LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model “from scratch” can require significantly more iterations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v291-dayi25a,
  title = 	 {Low-rank fine-tuning lies between lazy training and feature learning},
  author =       {Dayi, Arif Kerem and Chen, Sitan},
  booktitle = 	 {Proceedings of Thirty Eighth Conference on Learning Theory},
  pages = 	 {1415--1471},
  year = 	 {2025},
  editor = 	 {Haghtalab, Nika and Moitra, Ankur},
  volume = 	 {291},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--04 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v291/main/assets/dayi25a/dayi25a.pdf},
  url = 	 {https://proceedings.mlr.press/v291/dayi25a.html},
  abstract = 	 {LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model “from scratch” can require significantly more iterations.}
}

Endnote

%0 Conference Paper
%T Low-rank fine-tuning lies between lazy training and feature learning
%A Arif Kerem Dayi
%A Sitan Chen
%B Proceedings of Thirty Eighth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2025
%E Nika Haghtalab
%E Ankur Moitra	
%F pmlr-v291-dayi25a
%I PMLR
%P 1415--1471
%U https://proceedings.mlr.press/v291/dayi25a.html
%V 291
%X LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model “from scratch” can require significantly more iterations.

APA

Dayi, A.K. & Chen, S.. (2025). Low-rank fine-tuning lies between lazy training and feature learning. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:1415-1471 Available from https://proceedings.mlr.press/v291/dayi25a.html.

Related Material

Download PDF