WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Benjamin Feuer, Chinmay Hegde
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:17100-17130, 2025.

Abstract

Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-feuer25a, title = {{W}ild{C}hat-50{M}: A Deep Dive Into the Role of Synthetic Data in Post-Training}, author = {Feuer, Benjamin and Hegde, Chinmay}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {17100--17130}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/feuer25a/feuer25a.pdf}, url = {https://proceedings.mlr.press/v267/feuer25a.html}, abstract = {Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.} }
Endnote
%0 Conference Paper %T WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training %A Benjamin Feuer %A Chinmay Hegde %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-feuer25a %I PMLR %P 17100--17130 %U https://proceedings.mlr.press/v267/feuer25a.html %V 267 %X Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.
APA
Feuer, B. & Hegde, C.. (2025). WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:17100-17130 Available from https://proceedings.mlr.press/v267/feuer25a.html.

Related Material