Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:4747-4763, 2022.
The creation of realistic, synthetic datasets has several purposes with growing demand in recent times, e.g. privacy protection and other cases where real data cannot be easily shared. A multitude of primarily neural networks (NNs), e.g. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), or Bayesian Network (BN) approaches have been created to tackle this problem, however these require extensive compute resources, lack interpretability, and in some instances lack replication fidelity as well. We propose a hybrid, probabilistic approach for synthesizing pairwise independent tabular data, called Synthsonic. A sequence of well-understood, invertible statistical transformations removes first-order correlations, then a Bayesian Network jointly models continuous and categorical variables, and a calibrated discriminative learner captures the remaining dependencies. Replication studies on MIT’s SDGym benchmark show marginally or significantly better performance than all prior BN-based approaches, while being competitive with NN-based approaches (first place in 10 out of 13 benchmark datasets). The computational time required to learn the data distribution is at least one order of magnitude lower than the NN methods. Furthermore, inspecting intermediate results during the synthetic data generation allows easy diagnostics and tailored corrections. We believe the combination of out-of-the-box performance, speed and interpretability make this method a significant addition to the synthetic data generation