[edit]
LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data
Proceedings of the 9th Machine Learning for Healthcare Conference, PMLR 252, 2024.
Abstract
Recent advancements in large language models (LLMs) have shown promise in tasks like question answering, text summarization, and code generation. However, their effectiveness within the healthcare sector remains uncertain. This study investigates LLMs’ potential in generating synthetic Electronic Health Records (EHRs) by assessing their ability to produce structured data. Unfortunately, our preliminary results indicate that employing LLMs directly resulted in poor statistical similarity and utility. Feeding real-world dataset to LLMs could mitigate this issue, but privacy concerns were raised when uploading pa- tients’ information to the LLM API. To address these challenges and unleash the potential of LLMs in health data science, we present a new generation pipeline called LLMSYN. This pipeline utilizes only high-level statistical information from datasets and publicly available medical knowledge. The results demonstrate that the generated EHRs by LLMSYN ex- hibit improved statistical similarity and utility in downstream tasks, achieving predictive performance comparable to training with real data, while presenting minimal privacy risks. Our findings suggest that LLMSYN offers a promising approach to enhance the utility of LLM models in synthetic structured EHR generation.