[edit]
ScoEHR: Generating Synthetic Electronic Health Records using Continuous-time Diffusion Models
Proceedings of the 8th Machine Learning for Healthcare Conference, PMLR 219:489-508, 2023.
Abstract
Global access to statistically and clinically representative patient health data holds potential for advancing disease research, enhancing patient care, and accelerating drug development. However, acquisition of health data such as electronic health records (EHRs) comes with challenges characterised by high costs, time constraints, and concerns related to patient privacy. An approach to tackling these challenges is by using synthetic data. In this paper we introduce ScoEHR, a novel deep learning method for generating synthetic EHRs, which combines an autoencoder with a continuous-time diffusion model. ScoEHR is shown to outperform three baseline synthetic EHR generation frameworks (medGAN, medWGAN, and medBGAN) on two publicly available datasets, MIMIC-III and the Yale New Haven Health System Emergency Department dataset, based on four widely accepted metrics of data utility. Additionally, a blind clinician evaluation was carried out to assess the qualitative realism of the synthetic data generated by ScoEHR. In this evaluation, a patient’s data was labeled as ‘unrealistic’ if at least one clinician found it to be unrealistic. This evaluation showed that existing real EHR data and ScoEHR generated synthetic data were scored as equally realistic. Our code is available at https://github.com/aanaseer/ ScoEHR.