[edit]
NLP-Assisted Case Identification and Interpretable Machine Learning for Long COVID Detection in Primary Care EMRs
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:115-126, 2026.
Abstract
Identifying patients with Long COVID Syndrome (LCS) remains a challenge due to various symptoms, heterogeneous clinical presentation, and inconsistent documentation in electronic medical records. In this study, we develop a machine learning framework that uses natural language processing (NLP) to identify confirmed cases of LCS from physician encounter notes and to predict individuals at risk. Using data from the Manitoba COVID-19 Cohort linked to the Manitoba Primary Care Research Network (MaPCReN), we construct a set of characteristics that incorporate demographics, socioeconomic indicators, and pre and post-COVID symptom profiles. We frame Long COVID identification as an extreme class-imbalance NLP classification problem ( 4% confirmed cases in the development cohort) and address this challenge using imbalance-aware learning through random under-sampling and over-sampling strategies. Logistic regression with elastic net regularization combined with under-sampling achieves the best performance, with a sensitivity of 0.95, specificity of 0.81, and an AUC of 0.94, identifying 1,124 potential LCS cases among 4,556 COVID-19 positive individuals. These results demonstrate that combining unstructured clinical text with interpretable, imbalance-aware learning enables scalable Long COVID surveillance and risk identification in real-world EMR settings.