DataDecide: How to Predict Best Pretraining Data with Small Experiments

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:42487-42502, 2025.

Abstract

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide—the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) ($\tilde$ 80% of comparisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval $>$ 80% predictable at the target 1B scale with just 0.01% of the compute.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-magnusson25a, title = {{D}ata{D}ecide: How to Predict Best Pretraining Data with Small Experiments}, author = {Magnusson, Ian and Tai, Nguyen and Bogin, Ben and Heineman, David and Hwang, Jena D. and Soldaini, Luca and Bhagia, Akshita and Liu, Jiacheng and Groeneveld, Dirk and Tafjord, Oyvind and Smith, Noah A. and Koh, Pang Wei and Dodge, Jesse}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {42487--42502}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/magnusson25a/magnusson25a.pdf}, url = {https://proceedings.mlr.press/v267/magnusson25a.html}, abstract = {Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide—the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) ($\tilde$ 80% of comparisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval $>$ 80% predictable at the target 1B scale with just 0.01% of the compute.} }
Endnote
%0 Conference Paper %T DataDecide: How to Predict Best Pretraining Data with Small Experiments %A Ian Magnusson %A Nguyen Tai %A Ben Bogin %A David Heineman %A Jena D. Hwang %A Luca Soldaini %A Akshita Bhagia %A Jiacheng Liu %A Dirk Groeneveld %A Oyvind Tafjord %A Noah A. Smith %A Pang Wei Koh %A Jesse Dodge %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-magnusson25a %I PMLR %P 42487--42502 %U https://proceedings.mlr.press/v267/magnusson25a.html %V 267 %X Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide—the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) ($\tilde$ 80% of comparisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval $>$ 80% predictable at the target 1B scale with just 0.01% of the compute.
APA
Magnusson, I., Tai, N., Bogin, B., Heineman, D., Hwang, J.D., Soldaini, L., Bhagia, A., Liu, J., Groeneveld, D., Tafjord, O., Smith, N.A., Koh, P.W. & Dodge, J.. (2025). DataDecide: How to Predict Best Pretraining Data with Small Experiments. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:42487-42502 Available from https://proceedings.mlr.press/v267/magnusson25a.html.

Related Material