LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:43302-43330, 2025.

Abstract

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-mayilvahanan25a, title = {{LLM}s on the Line: Data Determines Loss-to-Loss Scaling Laws}, author = {Mayilvahanan, Prasanna and Wiedemer, Thadd\"{a}us and Mallick, Sayak and Bethge, Matthias and Brendel, Wieland}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {43302--43330}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mayilvahanan25a/mayilvahanan25a.pdf}, url = {https://proceedings.mlr.press/v267/mayilvahanan25a.html}, abstract = {Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.} }
Endnote
%0 Conference Paper %T LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws %A Prasanna Mayilvahanan %A Thaddäus Wiedemer %A Sayak Mallick %A Matthias Bethge %A Wieland Brendel %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-mayilvahanan25a %I PMLR %P 43302--43330 %U https://proceedings.mlr.press/v267/mayilvahanan25a.html %V 267 %X Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
APA
Mayilvahanan, P., Wiedemer, T., Mallick, S., Bethge, M. & Brendel, W.. (2025). LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:43302-43330 Available from https://proceedings.mlr.press/v267/mayilvahanan25a.html.

Related Material