Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1288-1296, 2024.

Abstract

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-jolicoeur-martineau24a, title = {Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees}, author = {Jolicoeur-Martineau, Alexia and Fatras, Kilian and Kachman, Tal}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {1288--1296}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/jolicoeur-martineau24a/jolicoeur-martineau24a.pdf}, url = {https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html}, abstract = {Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.} }
Endnote
%0 Conference Paper %T Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees %A Alexia Jolicoeur-Martineau %A Kilian Fatras %A Tal Kachman %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-jolicoeur-martineau24a %I PMLR %P 1288--1296 %U https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html %V 238 %X Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.
APA
Jolicoeur-Martineau, A., Fatras, K. & Kachman, T.. (2024). Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1288-1296 Available from https://proceedings.mlr.press/v238/jolicoeur-martineau24a.html.

Related Material