EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, Cao Xiao, Jimeng Sun
Proceedings of the 6th Machine Learning for Healthcare Conference, PMLR 149:260-282, 2021.

Abstract

Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.

Cite this Paper


BibTeX
@InProceedings{pmlr-v149-biswal21a, title = {EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders}, author = {Biswal, Siddharth and Ghosh, Soumya and Duke, Jon and Malin, Bradley and Stewart, Walter and Xiao, Cao and Sun, Jimeng}, booktitle = {Proceedings of the 6th Machine Learning for Healthcare Conference}, pages = {260--282}, year = {2021}, editor = {Jung, Ken and Yeung, Serena and Sendak, Mark and Sjoding, Michael and Ranganath, Rajesh}, volume = {149}, series = {Proceedings of Machine Learning Research}, month = {06--07 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v149/biswal21a/biswal21a.pdf}, url = {https://proceedings.mlr.press/v149/biswal21a.html}, abstract = {Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.} }
Endnote
%0 Conference Paper %T EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders %A Siddharth Biswal %A Soumya Ghosh %A Jon Duke %A Bradley Malin %A Walter Stewart %A Cao Xiao %A Jimeng Sun %B Proceedings of the 6th Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2021 %E Ken Jung %E Serena Yeung %E Mark Sendak %E Michael Sjoding %E Rajesh Ranganath %F pmlr-v149-biswal21a %I PMLR %P 260--282 %U https://proceedings.mlr.press/v149/biswal21a.html %V 149 %X Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.
APA
Biswal, S., Ghosh, S., Duke, J., Malin, B., Stewart, W., Xiao, C. & Sun, J.. (2021). EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders. Proceedings of the 6th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 149:260-282 Available from https://proceedings.mlr.press/v149/biswal21a.html.

Related Material