[edit]
The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1592-1608, 2026.
Abstract
Large-scale critical care datasets have driven major progress in clinical {AI}, yet most remain limited to single institutions. The newly released {CRITICAL} dataset expands this scope, linking 1.95 billion records from 371,365 patients across four {CTSA} sites and capturing longitudinal patient journeys from pre-{ICU} to post-{ICU} care. Its scale and diversity enable more generalizable modeling but introduce significant challenges in data cleaning, vocabulary harmonization, and computational efficiency. We introduce {CRISP} ({CRITICAL} Records Integrated Standardization Pipeline), a scalable framework that transforms the raw {CRITICAL} resource into machine-learning–ready form. {CRISP} performs systematic data validation, cross-vocabulary mapping, and unit standardization while maintaining full auditability. Through parallelized optimization, it processes the entire dataset in under a day on standard computing hardware. The pipeline also provides reproducible baselines across multiple clinical prediction tasks, substantially reducing data preparation time and enabling consistent, multi-institutional evaluation. All code, documentation, and benchmarks are publicly available to support transparent and scalable clinical {AI} research.