A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets

Ossi Räisä, Antti Honkela
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:307-315, 2025.

Abstract

Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice with several real datasets, downstream predictors and error metrics. As our theory predicts, multiple synthetic datasets often improve accuracy, while a single large synthetic dataset gives at best minimal improvement, showing that our insights are practically relevant.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-raisa25a, title = {A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets}, author = {R{\"a}is{\"a}, Ossi and Honkela, Antti}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {307--315}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/raisa25a/raisa25a.pdf}, url = {https://proceedings.mlr.press/v258/raisa25a.html}, abstract = {Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice with several real datasets, downstream predictors and error metrics. As our theory predicts, multiple synthetic datasets often improve accuracy, while a single large synthetic dataset gives at best minimal improvement, showing that our insights are practically relevant.} }
Endnote
%0 Conference Paper %T A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets %A Ossi Räisä %A Antti Honkela %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-raisa25a %I PMLR %P 307--315 %U https://proceedings.mlr.press/v258/raisa25a.html %V 258 %X Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice with several real datasets, downstream predictors and error metrics. As our theory predicts, multiple synthetic datasets often improve accuracy, while a single large synthetic dataset gives at best minimal improvement, showing that our insights are practically relevant.
APA
Räisä, O. & Honkela, A.. (2025). A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:307-315 Available from https://proceedings.mlr.press/v258/raisa25a.html.

Related Material