[edit]
Evaluation Protocols Under Extreme Class Imbalance: Evidence from a Newborn Screening Case Study
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:476-491, 2026.
Abstract
Evaluation protocols, such as cross-validation and bootstrap, are extensively used when experimenting with machine learning and AI models for obtain reliable performance estimates. However, the choice of the specific configurations used—e.g., 5-fold versus 10-fold cross-validation—or the strategies for hyper-parameter tuning is often arbitrary, with researchers relying on frequently used defaults. There is limited knowledge about how these selections influence the reported performance, particularly in scenarios characterized by extreme class imbalance. In such challenging scenarios, researchers often apply resampling strategies, such as, random oversampling, or smote, to improve the performance on the rare class. However, their effects on performance estimation under such extreme conditions also remains largely unexplored. This paper investigates the implications of mutliple evaluation protocol choices in the context of extreme class imbalance, using a real-world case study in newborn screening to illustrate the practical impact on model assessment and reliability. Our findings show that some design choices critically influence the variability of the results, and different configurations can affect results robustness, sometimes leading to conflicting conclusions about the best-performing model.