The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data

Xiaolong Luo, Michael Lingzhi Li
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1592-1608, 2026.

Abstract

Large-scale critical care datasets have driven major progress in clinical {AI}, yet most remain limited to single institutions. The newly released {CRITICAL} dataset expands this scope, linking 1.95 billion records from 371,365 patients across four {CTSA} sites and capturing longitudinal patient journeys from pre-{ICU} to post-{ICU} care. Its scale and diversity enable more generalizable modeling but introduce significant challenges in data cleaning, vocabulary harmonization, and computational efficiency. We introduce {CRISP} ({CRITICAL} Records Integrated Standardization Pipeline), a scalable framework that transforms the raw {CRITICAL} resource into machine-learning–ready form. {CRISP} performs systematic data validation, cross-vocabulary mapping, and unit standardization while maintaining full auditability. Through parallelized optimization, it processes the entire dataset in under a day on standard computing hardware. The pipeline also provides reproducible baselines across multiple clinical prediction tasks, substantially reducing data preparation time and enabling consistent, multi-institutional evaluation. All code, documentation, and benchmarks are publicly available to support transparent and scalable clinical {AI} research.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-luo26a, title = {The {CRITICAL} Records Integrated Standardization Pipeline ({CRISP}): End-to-End Processing of Large-scale Multi-institutional {OMOP} {CDM} Data}, author = {Luo, Xiaolong and Li, Michael Lingzhi}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {1592--1608}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/luo26a/luo26a.pdf}, url = {https://proceedings.mlr.press/v297/luo26a.html}, abstract = {Large-scale critical care datasets have driven major progress in clinical {AI}, yet most remain limited to single institutions. The newly released {CRITICAL} dataset expands this scope, linking 1.95 billion records from 371,365 patients across four {CTSA} sites and capturing longitudinal patient journeys from pre-{ICU} to post-{ICU} care. Its scale and diversity enable more generalizable modeling but introduce significant challenges in data cleaning, vocabulary harmonization, and computational efficiency. We introduce {CRISP} ({CRITICAL} Records Integrated Standardization Pipeline), a scalable framework that transforms the raw {CRITICAL} resource into machine-learning–ready form. {CRISP} performs systematic data validation, cross-vocabulary mapping, and unit standardization while maintaining full auditability. Through parallelized optimization, it processes the entire dataset in under a day on standard computing hardware. The pipeline also provides reproducible baselines across multiple clinical prediction tasks, substantially reducing data preparation time and enabling consistent, multi-institutional evaluation. All code, documentation, and benchmarks are publicly available to support transparent and scalable clinical {AI} research.} }
Endnote
%0 Conference Paper %T The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data %A Xiaolong Luo %A Michael Lingzhi Li %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-luo26a %I PMLR %P 1592--1608 %U https://proceedings.mlr.press/v297/luo26a.html %V 297 %X Large-scale critical care datasets have driven major progress in clinical {AI}, yet most remain limited to single institutions. The newly released {CRITICAL} dataset expands this scope, linking 1.95 billion records from 371,365 patients across four {CTSA} sites and capturing longitudinal patient journeys from pre-{ICU} to post-{ICU} care. Its scale and diversity enable more generalizable modeling but introduce significant challenges in data cleaning, vocabulary harmonization, and computational efficiency. We introduce {CRISP} ({CRITICAL} Records Integrated Standardization Pipeline), a scalable framework that transforms the raw {CRITICAL} resource into machine-learning–ready form. {CRISP} performs systematic data validation, cross-vocabulary mapping, and unit standardization while maintaining full auditability. Through parallelized optimization, it processes the entire dataset in under a day on standard computing hardware. The pipeline also provides reproducible baselines across multiple clinical prediction tasks, substantially reducing data preparation time and enabling consistent, multi-institutional evaluation. All code, documentation, and benchmarks are publicly available to support transparent and scalable clinical {AI} research.
APA
Luo, X. & Li, M.L.. (2026). The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1592-1608 Available from https://proceedings.mlr.press/v297/luo26a.html.

Related Material