Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data

Max Baak, Simon Brugman, Ilan Fridman Rojas, Lorraine Dalmeida, Ralph E.Q. Urlus, Jean-Baptiste Oger
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:4747-4763, 2022.

Abstract

The creation of realistic, synthetic datasets has several purposes with growing demand in recent times, e.g. privacy protection and other cases where real data cannot be easily shared. A multitude of primarily neural networks (NNs), e.g. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), or Bayesian Network (BN) approaches have been created to tackle this problem, however these require extensive compute resources, lack interpretability, and in some instances lack replication fidelity as well. We propose a hybrid, probabilistic approach for synthesizing pairwise independent tabular data, called Synthsonic. A sequence of well-understood, invertible statistical transformations removes first-order correlations, then a Bayesian Network jointly models continuous and categorical variables, and a calibrated discriminative learner captures the remaining dependencies. Replication studies on MIT’s SDGym benchmark show marginally or significantly better performance than all prior BN-based approaches, while being competitive with NN-based approaches (first place in 10 out of 13 benchmark datasets). The computational time required to learn the data distribution is at least one order of magnitude lower than the NN methods. Furthermore, inspecting intermediate results during the synthetic data generation allows easy diagnostics and tailored corrections. We believe the combination of out-of-the-box performance, speed and interpretability make this method a significant addition to the synthetic data generation

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-baak22a, title = { Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data }, author = {Baak, Max and Brugman, Simon and Fridman Rojas, Ilan and Dalmeida, Lorraine and E.Q. Urlus, Ralph and Oger, Jean-Baptiste}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {4747--4763}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/baak22a/baak22a.pdf}, url = {https://proceedings.mlr.press/v151/baak22a.html}, abstract = { The creation of realistic, synthetic datasets has several purposes with growing demand in recent times, e.g. privacy protection and other cases where real data cannot be easily shared. A multitude of primarily neural networks (NNs), e.g. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), or Bayesian Network (BN) approaches have been created to tackle this problem, however these require extensive compute resources, lack interpretability, and in some instances lack replication fidelity as well. We propose a hybrid, probabilistic approach for synthesizing pairwise independent tabular data, called Synthsonic. A sequence of well-understood, invertible statistical transformations removes first-order correlations, then a Bayesian Network jointly models continuous and categorical variables, and a calibrated discriminative learner captures the remaining dependencies. Replication studies on MIT’s SDGym benchmark show marginally or significantly better performance than all prior BN-based approaches, while being competitive with NN-based approaches (first place in 10 out of 13 benchmark datasets). The computational time required to learn the data distribution is at least one order of magnitude lower than the NN methods. Furthermore, inspecting intermediate results during the synthetic data generation allows easy diagnostics and tailored corrections. We believe the combination of out-of-the-box performance, speed and interpretability make this method a significant addition to the synthetic data generation } }
Endnote
%0 Conference Paper %T Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data %A Max Baak %A Simon Brugman %A Ilan Fridman Rojas %A Lorraine Dalmeida %A Ralph E.Q. Urlus %A Jean-Baptiste Oger %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-baak22a %I PMLR %P 4747--4763 %U https://proceedings.mlr.press/v151/baak22a.html %V 151 %X The creation of realistic, synthetic datasets has several purposes with growing demand in recent times, e.g. privacy protection and other cases where real data cannot be easily shared. A multitude of primarily neural networks (NNs), e.g. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), or Bayesian Network (BN) approaches have been created to tackle this problem, however these require extensive compute resources, lack interpretability, and in some instances lack replication fidelity as well. We propose a hybrid, probabilistic approach for synthesizing pairwise independent tabular data, called Synthsonic. A sequence of well-understood, invertible statistical transformations removes first-order correlations, then a Bayesian Network jointly models continuous and categorical variables, and a calibrated discriminative learner captures the remaining dependencies. Replication studies on MIT’s SDGym benchmark show marginally or significantly better performance than all prior BN-based approaches, while being competitive with NN-based approaches (first place in 10 out of 13 benchmark datasets). The computational time required to learn the data distribution is at least one order of magnitude lower than the NN methods. Furthermore, inspecting intermediate results during the synthetic data generation allows easy diagnostics and tailored corrections. We believe the combination of out-of-the-box performance, speed and interpretability make this method a significant addition to the synthetic data generation
APA
Baak, M., Brugman, S., Fridman Rojas, I., Dalmeida, L., E.Q. Urlus, R. & Oger, J.. (2022). Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:4747-4763 Available from https://proceedings.mlr.press/v151/baak22a.html.

Related Material