[edit]
Manifold-Informed Cohort Discovery (MICD): A Framework for Uncovering Latent Risk Signals in Imbalanced Healthcare Data
Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:94-101, 2026.
Abstract
Risk stratification for Coronary Heart Disease (CHD) is fundamentally challenged by severe class imbalance and the structural heterogeneity of the non-diseased patient cohort. Standard classification models, by treating all CHD-negative patients uniformly, fail to detect critical, latent high-risk sub-groups. We introduce the Manifold-Informed Cohort Discovery (MICD) Framework, a novel methodology that systematically integrates clinically-informed feature selection, Manifold Learning (UMAP), and proximity-based clustering to extract these latent risk signals. Our core insight is that individuals with latent high-risk profiles exist in close geometric proximity to true CHD-positive cases within the UMAP-embedded feature space. We validate the framework’s clinical relevance by autonomously isolating a high-risk negative cohort whose feature profile strongly aligns with the established diagnostic markers of Metabolic Syndrome. This alignment proves that our abstract geometric approach encodes a biologically and clinically meaningful pre-disease state. When the insights from this cohort discovery are used in a downstream classification task, the MICD-enhanced model achieves pre-eminent predictive performance (AUROC $\tilde$ 85.1%), significantly outperforming the clinical gold standard (ASCVD Risk Calculator) and state-of-the-art imbalanced learning methods (Focal Loss, SMOTE). Our work establishes a critical, interpretable link between unsupervised data structure and actionable supervised clinical prediction, providing a powerful tool for early, preventative intervention.