Synthetic Data, Real Errors: How (Not) to Publish and Use Synthetic Data

Boris Van Breugel, Zhaozhi Qian, Mihaela Van Der Schaar
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:34793-34808, 2023.

Abstract

Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach—using synthetic data as if it is real—leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE)—a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-van-breugel23a, title = {Synthetic Data, Real Errors: How ({N}ot) to Publish and Use Synthetic Data}, author = {Van Breugel, Boris and Qian, Zhaozhi and Van Der Schaar, Mihaela}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {34793--34808}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/van-breugel23a/van-breugel23a.pdf}, url = {https://proceedings.mlr.press/v202/van-breugel23a.html}, abstract = {Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach—using synthetic data as if it is real—leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE)—a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest.} }
Endnote
%0 Conference Paper %T Synthetic Data, Real Errors: How (Not) to Publish and Use Synthetic Data %A Boris Van Breugel %A Zhaozhi Qian %A Mihaela Van Der Schaar %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-van-breugel23a %I PMLR %P 34793--34808 %U https://proceedings.mlr.press/v202/van-breugel23a.html %V 202 %X Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach—using synthetic data as if it is real—leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE)—a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest.
APA
Van Breugel, B., Qian, Z. & Van Der Schaar, M.. (2023). Synthetic Data, Real Errors: How (Not) to Publish and Use Synthetic Data. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:34793-34808 Available from https://proceedings.mlr.press/v202/van-breugel23a.html.

Related Material