Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:56719-56789, 2025.

Abstract

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-springer25a, title = {Overtrained Language Models Are Harder to Fine-Tune}, author = {Springer, Jacob Mitchell and Goyal, Sachin and Wen, Kaiyue and Kumar, Tanishq and Yue, Xiang and Malladi, Sadhika and Neubig, Graham and Raghunathan, Aditi}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {56719--56789}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/springer25a/springer25a.pdf}, url = {https://proceedings.mlr.press/v267/springer25a.html}, abstract = {Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.} }
Endnote
%0 Conference Paper %T Overtrained Language Models Are Harder to Fine-Tune %A Jacob Mitchell Springer %A Sachin Goyal %A Kaiyue Wen %A Tanishq Kumar %A Xiang Yue %A Sadhika Malladi %A Graham Neubig %A Aditi Raghunathan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-springer25a %I PMLR %P 56719--56789 %U https://proceedings.mlr.press/v267/springer25a.html %V 267 %X Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
APA
Springer, J.M., Goyal, S., Wen, K., Kumar, T., Yue, X., Malladi, S., Neubig, G. & Raghunathan, A.. (2025). Overtrained Language Models Are Harder to Fine-Tune. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:56719-56789 Available from https://proceedings.mlr.press/v267/springer25a.html.

Related Material