CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Aditya Gorla, Ryan Wang, Zhengtong Liu, Ulzee An, Sriram Sankararaman
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:20187-20225, 2025.

Abstract

We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gorla25a, title = {{CACTI}: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation}, author = {Gorla, Aditya and Wang, Ryan and Liu, Zhengtong and An, Ulzee and Sankararaman, Sriram}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {20187--20225}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gorla25a/gorla25a.pdf}, url = {https://proceedings.mlr.press/v267/gorla25a.html}, abstract = {We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.} }
Endnote
%0 Conference Paper %T CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation %A Aditya Gorla %A Ryan Wang %A Zhengtong Liu %A Ulzee An %A Sriram Sankararaman %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gorla25a %I PMLR %P 20187--20225 %U https://proceedings.mlr.press/v267/gorla25a.html %V 267 %X We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.
APA
Gorla, A., Wang, R., Liu, Z., An, U. & Sankararaman, S.. (2025). CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:20187-20225 Available from https://proceedings.mlr.press/v267/gorla25a.html.

Related Material