LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data

Yijie Hao; Huan He; Joyce C. Ho

LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data

Yijie Hao, Huan He, Joyce C. Ho

Proceedings of the 9th Machine Learning for Healthcare Conference, PMLR 252, 2024.

Abstract

Recent advancements in large language models (LLMs) have shown promise in tasks like question answering, text summarization, and code generation. However, their effectiveness within the healthcare sector remains uncertain. This study investigates LLMs’ potential in generating synthetic Electronic Health Records (EHRs) by assessing their ability to produce structured data. Unfortunately, our preliminary results indicate that employing LLMs directly resulted in poor statistical similarity and utility. Feeding real-world dataset to LLMs could mitigate this issue, but privacy concerns were raised when uploading pa- tients’ information to the LLM API. To address these challenges and unleash the potential of LLMs in health data science, we present a new generation pipeline called LLMSYN. This pipeline utilizes only high-level statistical information from datasets and publicly available medical knowledge. The results demonstrate that the generated EHRs by LLMSYN ex- hibit improved statistical similarity and utility in downstream tasks, achieving predictive performance comparable to training with real data, while presenting minimal privacy risks. Our findings suggest that LLMSYN offers a promising approach to enhance the utility of LLM models in synthetic structured EHR generation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v252-hao24a,
  title = 	 {{LLMSYN}: Generating Synthetic Electronic Health Records Without Patient-Level Data},
  author =       {Hao, Yijie and He, Huan and Ho, Joyce C.},
  booktitle = 	 {Proceedings of the 9th Machine Learning for Healthcare Conference},
  year = 	 {2024},
  editor = 	 {Deshpande, Kaivalya and Fiterau, Madalina and Joshi, Shalmali and Lipton, Zachary and Ranganath, Rajesh and Urteaga, Iñigo},
  volume = 	 {252},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--17 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v252/main/assets/hao24a/hao24a.pdf},
  url = 	 {https://proceedings.mlr.press/v252/hao24a.html},
  abstract = 	 {Recent advancements in large language models (LLMs) have shown promise in tasks like question answering, text summarization, and code generation. However, their effectiveness within the healthcare sector remains uncertain. This study investigates LLMs’ potential in generating synthetic Electronic Health Records (EHRs) by assessing their ability to produce structured data. Unfortunately, our preliminary results indicate that employing LLMs directly resulted in poor statistical similarity and utility. Feeding real-world dataset to LLMs could mitigate this issue, but privacy concerns were raised when uploading pa- tients’ information to the LLM API. To address these challenges and unleash the potential of LLMs in health data science, we present a new generation pipeline called LLMSYN. This pipeline utilizes only high-level statistical information from datasets and publicly available medical knowledge. The results demonstrate that the generated EHRs by LLMSYN ex- hibit improved statistical similarity and utility in downstream tasks, achieving predictive performance comparable to training with real data, while presenting minimal privacy risks.  Our findings suggest that LLMSYN offers a promising approach to enhance the utility of LLM models in synthetic structured EHR generation.}
}

Endnote

%0 Conference Paper
%T LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data
%A Yijie Hao
%A Huan He
%A Joyce C. Ho
%B Proceedings of the 9th Machine Learning for Healthcare Conference
%C Proceedings of Machine Learning Research
%D 2024
%E Kaivalya Deshpande
%E Madalina Fiterau
%E Shalmali Joshi
%E Zachary Lipton
%E Rajesh Ranganath
%E Iñigo Urteaga	
%F pmlr-v252-hao24a
%I PMLR
%U https://proceedings.mlr.press/v252/hao24a.html
%V 252
%X Recent advancements in large language models (LLMs) have shown promise in tasks like question answering, text summarization, and code generation. However, their effectiveness within the healthcare sector remains uncertain. This study investigates LLMs’ potential in generating synthetic Electronic Health Records (EHRs) by assessing their ability to produce structured data. Unfortunately, our preliminary results indicate that employing LLMs directly resulted in poor statistical similarity and utility. Feeding real-world dataset to LLMs could mitigate this issue, but privacy concerns were raised when uploading pa- tients’ information to the LLM API. To address these challenges and unleash the potential of LLMs in health data science, we present a new generation pipeline called LLMSYN. This pipeline utilizes only high-level statistical information from datasets and publicly available medical knowledge. The results demonstrate that the generated EHRs by LLMSYN ex- hibit improved statistical similarity and utility in downstream tasks, achieving predictive performance comparable to training with real data, while presenting minimal privacy risks.  Our findings suggest that LLMSYN offers a promising approach to enhance the utility of LLM models in synthetic structured EHR generation.

APA

Hao, Y., He, H. & Ho, J.C.. (2024). LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data. Proceedings of the 9th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 252 Available from https://proceedings.mlr.press/v252/hao24a.html.

Related Material

Download PDF