[edit]
Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:265-279, 2023.
Abstract
Missing data are ubiquitous in real world
applications and, if not adequately handled, may
lead to the loss of information and biased findings
in downstream analysis. Particularly,
high-dimensional incomplete data with a moderate
sample size, such as analysis of multi-omics data,
present daunting challenges. Imputation is arguably
the most popular method for handling missing data,
though existing imputation methods have a number of
limitations. Single imputation methods such as
matrix completion methods do not adequately account
for imputation uncertainty and hence would yield
improper statistical inference. In contrast,
multiple imputation (MI) methods allow for proper
inference but existing methods do not perform well
in high-dimensional settings. Our work aims to
address these significant methodological gaps,
leveraging recent advances in neural network
Gaussian process (NNGP) from a Bayesian
viewpoint. We propose two NNGP-based MI methods,
namely MI-NNGP, that can apply multiple imputations
for missing values from a joint (posterior
predictive) distribution. The MI-NNGP methods are
shown to significantly outperform existing
state-of-the-art methods on synthetic and real
datasets, in terms of imputation error, statistical
inference, robustness to missing rates, and
computation costs, under three missing data
mechanisms, MCAR, MAR, and MNAR.