Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels

Danial Dervovic, Michael Cashmore
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:1909-1917, 2025.

Abstract

Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution’s location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-dervovic25a, title = {Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels}, author = {Dervovic, Danial and Cashmore, Michael}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {1909--1917}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/dervovic25a/dervovic25a.pdf}, url = {https://proceedings.mlr.press/v258/dervovic25a.html}, abstract = {Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution’s location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.} }
Endnote
%0 Conference Paper %T Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels %A Danial Dervovic %A Michael Cashmore %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-dervovic25a %I PMLR %P 1909--1917 %U https://proceedings.mlr.press/v258/dervovic25a.html %V 258 %X Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution’s location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.
APA
Dervovic, D. & Cashmore, M.. (2025). Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:1909-1917 Available from https://proceedings.mlr.press/v258/dervovic25a.html.

Related Material