Unlocking Post-hoc Dataset Inference with Synthetic Data

Bihe Zhao; Pratyush Maini; Franziska Boenisch; Adam Dziedzic

Unlocking Post-hoc Dataset Inference with Synthetic Data

Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:77608-77629, 2025.

Abstract

The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhao25q,
  title = 	 {Unlocking Post-hoc Dataset Inference with Synthetic Data},
  author =       {Zhao, Bihe and Maini, Pratyush and Boenisch, Franziska and Dziedzic, Adam},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {77608--77629},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhao25q/zhao25q.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhao25q.html},
  abstract = 	 {The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.}
}

Endnote

%0 Conference Paper
%T Unlocking Post-hoc Dataset Inference with Synthetic Data
%A Bihe Zhao
%A Pratyush Maini
%A Franziska Boenisch
%A Adam Dziedzic
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhao25q
%I PMLR
%P 77608--77629
%U https://proceedings.mlr.press/v267/zhao25q.html
%V 267
%X The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.

APA

Zhao, B., Maini, P., Boenisch, F. & Dziedzic, A.. (2025). Unlocking Post-hoc Dataset Inference with Synthetic Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:77608-77629 Available from https://proceedings.mlr.press/v267/zhao25q.html.

Unlocking Post-hoc Dataset Inference with Synthetic Data

Abstract

Cite this Paper

Related Material