Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Louis Béthune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, Pierre Ablin
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4020-4042, 2025.

Abstract

A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as $1%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-bethune25a, title = {Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection}, author = {B\'{e}thune, Louis and Grangier, David and Busbridge, Dan and Gualdoni, Eleonora and Cuturi, Marco and Ablin, Pierre}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {4020--4042}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bethune25a/bethune25a.pdf}, url = {https://proceedings.mlr.press/v267/bethune25a.html}, abstract = {A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as $1%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.} }
Endnote
%0 Conference Paper %T Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection %A Louis Béthune %A David Grangier %A Dan Busbridge %A Eleonora Gualdoni %A Marco Cuturi %A Pierre Ablin %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-bethune25a %I PMLR %P 4020--4042 %U https://proceedings.mlr.press/v267/bethune25a.html %V 267 %X A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as $1%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
APA
Béthune, L., Grangier, D., Busbridge, D., Gualdoni, E., Cuturi, M. & Ablin, P.. (2025). Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4020-4042 Available from https://proceedings.mlr.press/v267/bethune25a.html.

Related Material