Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease

Benjamin D. Ballyk; Ankit Gupta; Sujay Konda; Kavitha Subramanian; Chris Landon; Ahmed Ammar Naseer; Georg Maierhofer; Sumanth Swaminathan; Vasudevan Venkateshwaran

Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease

Benjamin D. Ballyk, Ankit Gupta, Sujay Konda, Kavitha Subramanian, Chris Landon, Ahmed Ammar Naseer, Georg Maierhofer, Sumanth Swaminathan, Vasudevan Venkateshwaran

Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:985-1006, 2026.

Abstract

Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real ({TSTR}) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model ({DPTimeGAN}) maintains a mean authenticity of 0.778 on the {CKD} dataset, outperforming existing state-of-the-art models on the privacy–utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.

Cite this Paper

BibTeX

@InProceedings{pmlr-v297-ballyk26a,
  title = 	 {Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease},
  author =       {Ballyk, Benjamin D. and Gupta, Ankit and Konda, Sujay and Subramanian, Kavitha and Landon, Chris and Naseer, Ahmed Ammar and Maierhofer, Georg and Swaminathan, Sumanth and Venkateshwaran, Vasudevan},
  booktitle = 	 {Proceedings of the Fifth Machine Learning for Health Symposium},
  pages = 	 {985--1006},
  year = 	 {2026},
  editor = 	 {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush},
  volume = 	 {297},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v297/main/assets/ballyk26a/ballyk26a.pdf},
  url = 	 {https://proceedings.mlr.press/v297/ballyk26a.html},
  abstract = 	 {Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real ({TSTR}) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model ({DPTimeGAN}) maintains a mean authenticity of 0.778 on the {CKD} dataset, outperforming existing state-of-the-art models on the privacy–utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.}
}

Endnote

%0 Conference Paper
%T Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease
%A Benjamin D. Ballyk
%A Ankit Gupta
%A Sujay Konda
%A Kavitha Subramanian
%A Chris Landon
%A Ahmed Ammar Naseer
%A Georg Maierhofer
%A Sumanth Swaminathan
%A Vasudevan Venkateshwaran
%B Proceedings of the Fifth Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2026
%E Peniel Argaw
%E Haoran Zhang
%E Sarah Jabbour
%E Payal Chandak
%E Jerry Ji
%E Sumit Mukherjee
%E Olawale Salaudeen
%E Trenton Chang
%E Elizabeth Healey
%E Fabian Gröger
%E Amin Adibi
%E Stefan Hegselmann
%E Benjamin Wild
%E Ayush Noori	
%F pmlr-v297-ballyk26a
%I PMLR
%P 985--1006
%U https://proceedings.mlr.press/v297/ballyk26a.html
%V 297
%X Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real ({TSTR}) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model ({DPTimeGAN}) maintains a mean authenticity of 0.778 on the {CKD} dataset, outperforming existing state-of-the-art models on the privacy–utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.

APA

Ballyk, B.D., Gupta, A., Konda, S., Subramanian, K., Landon, C., Naseer, A.A., Maierhofer, G., Swaminathan, S. & Venkateshwaran, V.. (2026). Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:985-1006 Available from https://proceedings.mlr.press/v297/ballyk26a.html.

Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease

Abstract

Cite this Paper

Related Material