Measuring Diversity in Synthetic Datasets

Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:80373-80397, 2025.

Abstract

Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets—an aspect crucial for robust model performance—remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhu25ac, title = {Measuring Diversity in Synthetic Datasets}, author = {Zhu, Yuchang and Zhang, Huizhe and Wu, Bingzhe and Li, Jintang and Zheng, Zibin and Zhao, Peilin and Chen, Liang and Bian, Yatao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {80373--80397}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhu25ac/zhu25ac.pdf}, url = {https://proceedings.mlr.press/v267/zhu25ac.html}, abstract = {Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets—an aspect crucial for robust model performance—remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.} }
Endnote
%0 Conference Paper %T Measuring Diversity in Synthetic Datasets %A Yuchang Zhu %A Huizhe Zhang %A Bingzhe Wu %A Jintang Li %A Zibin Zheng %A Peilin Zhao %A Liang Chen %A Yatao Bian %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhu25ac %I PMLR %P 80373--80397 %U https://proceedings.mlr.press/v267/zhu25ac.html %V 267 %X Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets—an aspect crucial for robust model performance—remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.
APA
Zhu, Y., Zhang, H., Wu, B., Li, J., Zheng, Z., Zhao, P., Chen, L. & Bian, Y.. (2025). Measuring Diversity in Synthetic Datasets. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:80373-80397 Available from https://proceedings.mlr.press/v267/zhu25ac.html.

Related Material