[edit]
Generation of Bilingual Synthetic Clinical Notes for Realistic Data Augmentation
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:834-879, 2026.
Abstract
Synthetic clinical notes offer a promising solution to data scarcity and privacy constraints in clinical natural language processing. However, existing generation approaches often prioritize semantic accuracy while not adequately reproducing the linguistic and structural (i.e., surface) characteristics of real-world clinical documentation, limiting their utility for downstream clinical tasks. In this study, we propose an expert-informed prompt with feedback-loop generation framework to improve the fidelity of synthetic clinical notes across both semantic and surface-level dimensions. Using individual case safety reports from FAERS, we formulated synthetic note generation as a controlled text generation task conditioned on adverse drug reaction descriptions and clinical narratives. We evaluated the performance of the proposed approach by comparing it with other generation strategies (in-context learning and multi-agent generation) and prompting methods (base and expert-informed) under a unified experimental condition. Generation quality was assessed using embedding-based semantic similarity, surface-level statistical and distributional metrics, and blinded human evaluation. The feedback-loop generation framework achieved superior performance across semantic (mean clinical BERTScore = 0.885) and surface-level distributional metrics (token-level Jensen-Shannon divergence = 0.344), producing synthetic clinical notes that more closely resembled real-world clinical notes than other approaches. Expert-informed prompting further improved semantic fidelity and lexical diversity.