Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:58637-58657, 2025.

Abstract

Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-tan25g, title = {Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale {LLM}s}, author = {Tan, Bowen and Xu, Zheng and Xing, Eric and Hu, Zhiting and Wu, Shanshan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {58637--58657}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/tan25g/tan25g.pdf}, url = {https://proceedings.mlr.press/v267/tan25g.html}, abstract = {Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.} }
Endnote
%0 Conference Paper %T Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs %A Bowen Tan %A Zheng Xu %A Eric Xing %A Zhiting Hu %A Shanshan Wu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-tan25g %I PMLR %P 58637--58657 %U https://proceedings.mlr.press/v267/tan25g.html %V 267 %X Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
APA
Tan, B., Xu, Z., Xing, E., Hu, Z. & Wu, S.. (2025). Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:58637-58657 Available from https://proceedings.mlr.press/v267/tan25g.html.

Related Material