Evaluation Protocols Under Extreme Class Imbalance: Evidence from a Newborn Screening Case Study

Nicole Sabourin, Paula Branco, Matthew P.A. Henderson
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:476-491, 2026.

Abstract

Evaluation protocols, such as cross-validation and bootstrap, are extensively used when experimenting with machine learning and AI models for obtain reliable performance estimates. However, the choice of the specific configurations used—e.g., 5-fold versus 10-fold cross-validation—or the strategies for hyper-parameter tuning is often arbitrary, with researchers relying on frequently used defaults. There is limited knowledge about how these selections influence the reported performance, particularly in scenarios characterized by extreme class imbalance. In such challenging scenarios, researchers often apply resampling strategies, such as, random oversampling, or smote, to improve the performance on the rare class. However, their effects on performance estimation under such extreme conditions also remains largely unexplored. This paper investigates the implications of mutliple evaluation protocol choices in the context of extreme class imbalance, using a real-world case study in newborn screening to illustrate the practical impact on model assessment and reliability. Our findings show that some design choices critically influence the variability of the results, and different configurations can affect results robustness, sometimes leading to conflicting conclusions about the best-performing model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-sabourin26a, title = {Evaluation Protocols Under Extreme Class Imbalance: Evidence from a Newborn Screening Case Study}, author = {Sabourin, Nicole and Branco, Paula and Henderson, Matthew P.A.}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {476--491}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/sabourin26a/sabourin26a.pdf}, url = {https://proceedings.mlr.press/v318/sabourin26a.html}, abstract = {Evaluation protocols, such as cross-validation and bootstrap, are extensively used when experimenting with machine learning and AI models for obtain reliable performance estimates. However, the choice of the specific configurations used—e.g., 5-fold versus 10-fold cross-validation—or the strategies for hyper-parameter tuning is often arbitrary, with researchers relying on frequently used defaults. There is limited knowledge about how these selections influence the reported performance, particularly in scenarios characterized by extreme class imbalance. In such challenging scenarios, researchers often apply resampling strategies, such as, random oversampling, or smote, to improve the performance on the rare class. However, their effects on performance estimation under such extreme conditions also remains largely unexplored. This paper investigates the implications of mutliple evaluation protocol choices in the context of extreme class imbalance, using a real-world case study in newborn screening to illustrate the practical impact on model assessment and reliability. Our findings show that some design choices critically influence the variability of the results, and different configurations can affect results robustness, sometimes leading to conflicting conclusions about the best-performing model.} }
Endnote
%0 Conference Paper %T Evaluation Protocols Under Extreme Class Imbalance: Evidence from a Newborn Screening Case Study %A Nicole Sabourin %A Paula Branco %A Matthew P.A. Henderson %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-sabourin26a %I PMLR %P 476--491 %U https://proceedings.mlr.press/v318/sabourin26a.html %V 318 %X Evaluation protocols, such as cross-validation and bootstrap, are extensively used when experimenting with machine learning and AI models for obtain reliable performance estimates. However, the choice of the specific configurations used—e.g., 5-fold versus 10-fold cross-validation—or the strategies for hyper-parameter tuning is often arbitrary, with researchers relying on frequently used defaults. There is limited knowledge about how these selections influence the reported performance, particularly in scenarios characterized by extreme class imbalance. In such challenging scenarios, researchers often apply resampling strategies, such as, random oversampling, or smote, to improve the performance on the rare class. However, their effects on performance estimation under such extreme conditions also remains largely unexplored. This paper investigates the implications of mutliple evaluation protocol choices in the context of extreme class imbalance, using a real-world case study in newborn screening to illustrate the practical impact on model assessment and reliability. Our findings show that some design choices critically influence the variability of the results, and different configurations can affect results robustness, sometimes leading to conflicting conclusions about the best-performing model.
APA
Sabourin, N., Branco, P. & Henderson, M.P.. (2026). Evaluation Protocols Under Extreme Class Imbalance: Evidence from a Newborn Screening Case Study. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:476-491 Available from https://proceedings.mlr.press/v318/sabourin26a.html.

Related Material