[edit]
Learning Under Extreme Label Imbalance in EHRs: A Dependency-Aware Loss for Multi-Label Classification
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:880-904, 2026.
Abstract
Extreme multi-label next-visit diagnosis forecasting from electronic health records is dominated by label sparsity. Each visit contains only a handful of positive ICD-10 codes among thousands of candidates, yet codes are strongly correlated through comorbidity structure. In this regime, standard element-wise objectives (such as focal, and class-balanced loss) often maximize sensitivity at the cost of severe precision degradation, producing clinically impractical alert volumes. We propose an architecture-compatible dependency-aware ranking loss that (i) reweights per-code correctness under severe imbalance, (ii) aggregates errors with rank-based emphasis on the hardest labels, and (iii) regularizes predictions with a learned pairwise dependency term in the output space. Using an EHR Transformer backbone, we evaluate on the CPRD cohort ($V{=}1{,}538$ codes), benchmarking loss functions on 200{,}000 patients and validating scalability up to 3.2 million. The proposed objective shifts the precision–recall trade-off toward fewer false positives while maintaining competitive sensitivity, and preserves overall ranking quality (PRC–AUC comparable to weighted BCE). In addition, it yields an auditable population-level dependency matrix summarizing learned co-occurrence structure. These results suggest that explicit output-space structure can improve the precision–recall trade-off in sparse, high-dimensional next-visit diagnosis prediction from EHRs.