[edit]
Error Profiling of Machine Learning Models: An Exploratory Visualization
Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025.
Abstract
While data-driven predictive models are increasingly used in healthcare, their clinical translation remains limited—partly due to challenges in evaluating model performance across design choices. Existing explainability methods often focus on intra-model interpretability but fall short in supporting inter-model comparisons. We present a visualization-based error profiling method that facilitates comparative evaluation by highlighting overlaps and differences in model predictions. Our matrix-based visualization maps which models incorrectly classify which patient subgroups, with color intensity indicating the number of misclassified patients. This approach enables deeper insight into which (sub)populations are consistently (in)correctly classified across models, helping uncover patterns of model (dis)agreement and assess the impact of modeling decisions. We demonstrate our visualization method in four healthcare use cases: 1) missing data imputation in a longitudinal nutritional dataset; 2) feature set analysis using randomized controlled trial data; 3) end-model technical performance in cardiac morbidity prediction; and 4) data modality comparison using a dual-source lung cancer dataset with longitudinal and radiomic features. To evaluate the visualization, we obtained expert feedback and qualitative assessments of decision-making insights. Survey results—across clinicians, computer scientists, and medical informaticians—indicated that our method provides an interpretable and intuitive way to compare model error distributions by highlighting patterns within correctly and incorrectly classified subpopulations across different models. Our comprehensible error profiling approach represents an initial step toward a systematic framework for improving model assessment in clinical tasks. Through this framework, both model developers and end users can better understand when and where a given model is appropriate for real-world clinical deployment.