TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Pinto Coelho Nuti; Tim Franzmeyer; Joao F. Henriques

TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Pinto Coelho Nuti, Tim Franzmeyer, Joao F. Henriques

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:46837-46876, 2025.

Abstract

Past work has studied the effects of fine-tuning on large language models’ (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model’s intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-nuti25a,
  title = 	 {{T}u{C}o: Measuring the Contribution of Fine-Tuning to Individual Responses of {LLM}s},
  author =       {Nuti, Felipe Pinto Coelho and Franzmeyer, Tim and Henriques, Joao F.},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {46837--46876},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nuti25a/nuti25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/nuti25a.html},
  abstract = 	 {Past work has studied the effects of fine-tuning on large language models’ (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model’s intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.}
}

Endnote

%0 Conference Paper
%T TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs
%A Felipe Pinto Coelho Nuti
%A Tim Franzmeyer
%A Joao F. Henriques
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-nuti25a
%I PMLR
%P 46837--46876
%U https://proceedings.mlr.press/v267/nuti25a.html
%V 267
%X Past work has studied the effects of fine-tuning on large language models’ (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model’s intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.

APA

Nuti, F.P.C., Franzmeyer, T. & Henriques, J.F.. (2025). TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:46837-46876 Available from https://proceedings.mlr.press/v267/nuti25a.html.

TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Abstract

Cite this Paper

Related Material