Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:79281-79327, 2025.

Abstract

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM). ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens). We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhou25aa, title = {Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale}, author = {Zhou, Fan and Wang, Zengzhi and Liu, Qian and Li, Junlong and Liu, Pengfei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {79281--79327}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhou25aa/zhou25aa.pdf}, url = {https://proceedings.mlr.press/v267/zhou25aa.html}, abstract = {Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM). ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens). We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.} }
Endnote
%0 Conference Paper %T Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale %A Fan Zhou %A Zengzhi Wang %A Qian Liu %A Junlong Li %A Pengfei Liu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhou25aa %I PMLR %P 79281--79327 %U https://proceedings.mlr.press/v267/zhou25aa.html %V 267 %X Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM). ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens). We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.
APA
Zhou, F., Wang, Z., Liu, Q., Li, J. & Liu, P.. (2025). Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:79281-79327 Available from https://proceedings.mlr.press/v267/zhou25aa.html.

Related Material