Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:1833-1863, 2025.

Abstract

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-askari-hemmat25a, title = {Improving the Scaling Laws of Synthetic Data with Deliberate Practice}, author = {Askari-Hemmat, Reyhane and Pezeshki, Mohammad and Dohmatob, Elvis and Bordes, Florian and Astolfi, Pietro and Hall, Melissa and Verbeek, Jakob and Drozdzal, Michal and Romero-Soriano, Adriana}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {1833--1863}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/askari-hemmat25a/askari-hemmat25a.pdf}, url = {https://proceedings.mlr.press/v267/askari-hemmat25a.html}, abstract = {Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.} }
Endnote
%0 Conference Paper %T Improving the Scaling Laws of Synthetic Data with Deliberate Practice %A Reyhane Askari-Hemmat %A Mohammad Pezeshki %A Elvis Dohmatob %A Florian Bordes %A Pietro Astolfi %A Melissa Hall %A Jakob Verbeek %A Michal Drozdzal %A Adriana Romero-Soriano %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-askari-hemmat25a %I PMLR %P 1833--1863 %U https://proceedings.mlr.press/v267/askari-hemmat25a.html %V 267 %X Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.
APA
Askari-Hemmat, R., Pezeshki, M., Dohmatob, E., Bordes, F., Astolfi, P., Hall, M., Verbeek, J., Drozdzal, M. & Romero-Soriano, A.. (2025). Improving the Scaling Laws of Synthetic Data with Deliberate Practice. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:1833-1863 Available from https://proceedings.mlr.press/v267/askari-hemmat25a.html.

Related Material