Naive imputation implicitly regularizes high-dimensional linear models

Alexis Ayme; Claire Boyer; Aymeric Dieuleveut; Erwan Scornet

Naive imputation implicitly regularizes high-dimensional linear models

Alexis Ayme, Claire Boyer, Aymeric Dieuleveut, Erwan Scornet

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:1320-1340, 2023.

Abstract

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the $d \gg \sqrt{n}$ regime. Experiments illustrate our findings.

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-ayme23a,
  title = 	 {Naive imputation implicitly regularizes high-dimensional linear models},
  author =       {Ayme, Alexis and Boyer, Claire and Dieuleveut, Aymeric and Scornet, Erwan},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {1320--1340},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/ayme23a/ayme23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/ayme23a.html},
  abstract = 	 {Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the $d \gg \sqrt{n}$ regime. Experiments illustrate our findings.}
}

Endnote

%0 Conference Paper
%T Naive imputation implicitly regularizes high-dimensional linear models
%A Alexis Ayme
%A Claire Boyer
%A Aymeric Dieuleveut
%A Erwan Scornet
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-ayme23a
%I PMLR
%P 1320--1340
%U https://proceedings.mlr.press/v202/ayme23a.html
%V 202
%X Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the $d \gg \sqrt{n}$ regime. Experiments illustrate our findings.

APA

Ayme, A., Boyer, C., Dieuleveut, A. & Scornet, E.. (2023). Naive imputation implicitly regularizes high-dimensional linear models. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:1320-1340 Available from https://proceedings.mlr.press/v202/ayme23a.html.

Naive imputation implicitly regularizes high-dimensional linear models

Abstract

Cite this Paper

Related Material