A primer on linear classification with missing data

Angel David REYERO LOBO, Alexis Ayme, Claire Boyer, Erwan Scornet
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:1225-1233, 2025.

Abstract

Supervised learning with missing data aims at building the best prediction of a target output based on partially-observed inputs. Major approaches to address this problem can be decomposed into $(i)$ impute-then-predict strategies, which first fill in the empty input components and then apply a unique predictor and $(ii)$ Pattern-by-Pattern (P-b-P) approaches, where a predictor is built on each missing pattern. In this paper, we theoretically analyze how three classical linear classifiers, namely perceptron, logistic regression and linear discriminant analysis (LDA), behave with Missing Completely At Random (MCAR) data, depending on the strategy (imputation or P-b-P) to handle missing values. We prove that both imputation and P-b-P approaches are ill-specified in a logistic regression framework, thus questioning the relevance of such approaches to handle missing data. The most favorable auspices to perform classification with missing data concern P-b-P LDA methods. We provide finite-sample bounds for the excess risk in this framework, even for high-dimensional settings or MNAR data. Experiments illustrate our theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-lobo25a, title = {A primer on linear classification with missing data}, author = {LOBO, Angel David REYERO and Ayme, Alexis and Boyer, Claire and Scornet, Erwan}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {1225--1233}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/lobo25a/lobo25a.pdf}, url = {https://proceedings.mlr.press/v258/lobo25a.html}, abstract = {Supervised learning with missing data aims at building the best prediction of a target output based on partially-observed inputs. Major approaches to address this problem can be decomposed into $(i)$ impute-then-predict strategies, which first fill in the empty input components and then apply a unique predictor and $(ii)$ Pattern-by-Pattern (P-b-P) approaches, where a predictor is built on each missing pattern. In this paper, we theoretically analyze how three classical linear classifiers, namely perceptron, logistic regression and linear discriminant analysis (LDA), behave with Missing Completely At Random (MCAR) data, depending on the strategy (imputation or P-b-P) to handle missing values. We prove that both imputation and P-b-P approaches are ill-specified in a logistic regression framework, thus questioning the relevance of such approaches to handle missing data. The most favorable auspices to perform classification with missing data concern P-b-P LDA methods. We provide finite-sample bounds for the excess risk in this framework, even for high-dimensional settings or MNAR data. Experiments illustrate our theoretical findings.} }
Endnote
%0 Conference Paper %T A primer on linear classification with missing data %A Angel David REYERO LOBO %A Alexis Ayme %A Claire Boyer %A Erwan Scornet %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-lobo25a %I PMLR %P 1225--1233 %U https://proceedings.mlr.press/v258/lobo25a.html %V 258 %X Supervised learning with missing data aims at building the best prediction of a target output based on partially-observed inputs. Major approaches to address this problem can be decomposed into $(i)$ impute-then-predict strategies, which first fill in the empty input components and then apply a unique predictor and $(ii)$ Pattern-by-Pattern (P-b-P) approaches, where a predictor is built on each missing pattern. In this paper, we theoretically analyze how three classical linear classifiers, namely perceptron, logistic regression and linear discriminant analysis (LDA), behave with Missing Completely At Random (MCAR) data, depending on the strategy (imputation or P-b-P) to handle missing values. We prove that both imputation and P-b-P approaches are ill-specified in a logistic regression framework, thus questioning the relevance of such approaches to handle missing data. The most favorable auspices to perform classification with missing data concern P-b-P LDA methods. We provide finite-sample bounds for the excess risk in this framework, even for high-dimensional settings or MNAR data. Experiments illustrate our theoretical findings.
APA
LOBO, A.D.R., Ayme, A., Boyer, C. & Scornet, E.. (2025). A primer on linear classification with missing data. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:1225-1233 Available from https://proceedings.mlr.press/v258/lobo25a.html.

Related Material