TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data

Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:74858-74881, 2025.

Abstract

While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT’s superiority in both unconditional tabular data generation and conditional missing data imputation tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25t, title = {{T}ab{NAT}: A Continuous-Discrete Joint Generative Framework for Tabular Data}, author = {Zhang, Hengrui and Fang, Liancheng and Wu, Qitian and Yu, Philip S.}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {74858--74881}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25t/zhang25t.pdf}, url = {https://proceedings.mlr.press/v267/zhang25t.html}, abstract = {While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT’s superiority in both unconditional tabular data generation and conditional missing data imputation tasks.} }
Endnote
%0 Conference Paper %T TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data %A Hengrui Zhang %A Liancheng Fang %A Qitian Wu %A Philip S. Yu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25t %I PMLR %P 74858--74881 %U https://proceedings.mlr.press/v267/zhang25t.html %V 267 %X While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT’s superiority in both unconditional tabular data generation and conditional missing data imputation tasks.
APA
Zhang, H., Fang, L., Wu, Q. & Yu, P.S.. (2025). TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:74858-74881 Available from https://proceedings.mlr.press/v267/zhang25t.html.

Related Material