How Much Can We Forget about Data Contamination?

Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike Von Luxburg
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4998-5016, 2025.

Abstract

The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Llama 3 405B, have forgotten the data seen at the beginning of training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-bordt25a, title = {How Much Can We Forget about Data Contamination?}, author = {Bordt, Sebastian and Srinivas, Suraj and Boreiko, Valentyn and Luxburg, Ulrike Von}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {4998--5016}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bordt25a/bordt25a.pdf}, url = {https://proceedings.mlr.press/v267/bordt25a.html}, abstract = {The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Llama 3 405B, have forgotten the data seen at the beginning of training.} }
Endnote
%0 Conference Paper %T How Much Can We Forget about Data Contamination? %A Sebastian Bordt %A Suraj Srinivas %A Valentyn Boreiko %A Ulrike Von Luxburg %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-bordt25a %I PMLR %P 4998--5016 %U https://proceedings.mlr.press/v267/bordt25a.html %V 267 %X The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Llama 3 405B, have forgotten the data seen at the beginning of training.
APA
Bordt, S., Srinivas, S., Boreiko, V. & Luxburg, U.V.. (2025). How Much Can We Forget about Data Contamination?. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4998-5016 Available from https://proceedings.mlr.press/v267/bordt25a.html.

Related Material