NLP-Assisted Case Identification and Interpretable Machine Learning for Long COVID Detection in Primary Care EMRs

Surani Matharaarachchi, Alan Katz, Mike Domaratzki, Saman Muthukumarana
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:115-126, 2026.

Abstract

Identifying patients with Long COVID Syndrome (LCS) remains a challenge due to various symptoms, heterogeneous clinical presentation, and inconsistent documentation in electronic medical records. In this study, we develop a machine learning framework that uses natural language processing (NLP) to identify confirmed cases of LCS from physician encounter notes and to predict individuals at risk. Using data from the Manitoba COVID-19 Cohort linked to the Manitoba Primary Care Research Network (MaPCReN), we construct a set of characteristics that incorporate demographics, socioeconomic indicators, and pre and post-COVID symptom profiles. We frame Long COVID identification as an extreme class-imbalance NLP classification problem ( 4% confirmed cases in the development cohort) and address this challenge using imbalance-aware learning through random under-sampling and over-sampling strategies. Logistic regression with elastic net regularization combined with under-sampling achieves the best performance, with a sensitivity of 0.95, specificity of 0.81, and an AUC of 0.94, identifying 1,124 potential LCS cases among 4,556 COVID-19 positive individuals. These results demonstrate that combining unstructured clinical text with interpretable, imbalance-aware learning enables scalable Long COVID surveillance and risk identification in real-world EMR settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-matharaarachchi26a, title = {NLP-Assisted Case Identification and Interpretable Machine Learning for Long COVID Detection in Primary Care EMRs}, author = {Matharaarachchi, Surani and Katz, Alan and Domaratzki, Mike and Muthukumarana, Saman}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {115--126}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/matharaarachchi26a/matharaarachchi26a.pdf}, url = {https://proceedings.mlr.press/v318/matharaarachchi26a.html}, abstract = {Identifying patients with Long COVID Syndrome (LCS) remains a challenge due to various symptoms, heterogeneous clinical presentation, and inconsistent documentation in electronic medical records. In this study, we develop a machine learning framework that uses natural language processing (NLP) to identify confirmed cases of LCS from physician encounter notes and to predict individuals at risk. Using data from the Manitoba COVID-19 Cohort linked to the Manitoba Primary Care Research Network (MaPCReN), we construct a set of characteristics that incorporate demographics, socioeconomic indicators, and pre and post-COVID symptom profiles. We frame Long COVID identification as an extreme class-imbalance NLP classification problem ( 4% confirmed cases in the development cohort) and address this challenge using imbalance-aware learning through random under-sampling and over-sampling strategies. Logistic regression with elastic net regularization combined with under-sampling achieves the best performance, with a sensitivity of 0.95, specificity of 0.81, and an AUC of 0.94, identifying 1,124 potential LCS cases among 4,556 COVID-19 positive individuals. These results demonstrate that combining unstructured clinical text with interpretable, imbalance-aware learning enables scalable Long COVID surveillance and risk identification in real-world EMR settings.} }
Endnote
%0 Conference Paper %T NLP-Assisted Case Identification and Interpretable Machine Learning for Long COVID Detection in Primary Care EMRs %A Surani Matharaarachchi %A Alan Katz %A Mike Domaratzki %A Saman Muthukumarana %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-matharaarachchi26a %I PMLR %P 115--126 %U https://proceedings.mlr.press/v318/matharaarachchi26a.html %V 318 %X Identifying patients with Long COVID Syndrome (LCS) remains a challenge due to various symptoms, heterogeneous clinical presentation, and inconsistent documentation in electronic medical records. In this study, we develop a machine learning framework that uses natural language processing (NLP) to identify confirmed cases of LCS from physician encounter notes and to predict individuals at risk. Using data from the Manitoba COVID-19 Cohort linked to the Manitoba Primary Care Research Network (MaPCReN), we construct a set of characteristics that incorporate demographics, socioeconomic indicators, and pre and post-COVID symptom profiles. We frame Long COVID identification as an extreme class-imbalance NLP classification problem ( 4% confirmed cases in the development cohort) and address this challenge using imbalance-aware learning through random under-sampling and over-sampling strategies. Logistic regression with elastic net regularization combined with under-sampling achieves the best performance, with a sensitivity of 0.95, specificity of 0.81, and an AUC of 0.94, identifying 1,124 potential LCS cases among 4,556 COVID-19 positive individuals. These results demonstrate that combining unstructured clinical text with interpretable, imbalance-aware learning enables scalable Long COVID surveillance and risk identification in real-world EMR settings.
APA
Matharaarachchi, S., Katz, A., Domaratzki, M. & Muthukumarana, S.. (2026). NLP-Assisted Case Identification and Interpretable Machine Learning for Long COVID Detection in Primary Care EMRs. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:115-126 Available from https://proceedings.mlr.press/v318/matharaarachchi26a.html.

Related Material