Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Alexia Jolicoeur-Martineau; Kilian Fatras; Tal Kachman

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1288-1296, 2024.

Abstract

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.

Cite this Paper

BibTeX

@InProceedings{pmlr-v238-jolicoeur-martineau24a,
  title = 	 {Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees},
  author =       {Jolicoeur-Martineau, Alexia and Fatras, Kilian and Kachman, Tal},
  booktitle = 	 {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1288--1296},
  year = 	 {2024},
  editor = 	 {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen},
  volume = 	 {238},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--04 May},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v238/jolicoeur-martineau24a/jolicoeur-martineau24a.pdf},
  url = 	 {https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html},
  abstract = 	 {Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.}
}

Endnote

%0 Conference Paper
%T Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
%A Alexia Jolicoeur-Martineau
%A Kilian Fatras
%A Tal Kachman
%B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2024
%E Sanjoy Dasgupta
%E Stephan Mandt
%E Yingzhen Li	
%F pmlr-v238-jolicoeur-martineau24a
%I PMLR
%P 1288--1296
%U https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html
%V 238
%X Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.

APA

Jolicoeur-Martineau, A., Fatras, K. & Kachman, T.. (2024). Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1288-1296 Available from https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html.

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Abstract

Cite this Paper

Related Material