Language Generation with Infinite Contamination

Anay Mehrotra, Grigoris Velegkas, Xifan Yu, Felix Zhou
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:5053-5112, 2026.

Abstract

A recent line of work studies language generation in the limit, a formal model of language learning where an algorithm observes an adversarially generated enumeration of strings from an unknown target language $K$ and must eventually generate new, unseen strings from $K$. In this model, Kleinberg and Mullainathan (2024) proved that generation is achievable in surprisingly general settings; whenever $K$ belongs to a known countable collection of languages. However, their generator, while quite general, suffers from “mode collapse:” it generates from an ever-smaller subset of the target. To address this, Kleinberg and Wei (2025a) introduced a stronger notion of dense generation, requiring the output to asymptotically cover a positive fraction of the target, and showed it remains achievable for all countable collections. Both of these works rely on the crucial assumption of \textit{perfect} data: the adversary can neither insert strings from outside the target language (i.e., noise) nor omit strings from it (i.e., omissions). In practice, training data for language models is notoriously noisy, raising the fundamental question: \begin{center} \emph{How much contamination (either omissions or insertions) can language generation tolerate?} \end{center} Recent works have made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) (Raman and Raman, 2025) or omissions (but no noise) (Bai et al., 2026). We characterize the contamination tolerance of both types of generation by proving the following results: \begin{itemize} \item \textbf{Generation under Contamination:} Language generation in the limit is achievable for all countable collections if and only if the fraction of contaminated examples converges to zero. When this condition fails, we characterize the collections which remain generable. \item \textbf{Dense Generation under Contamination:} Dense generation is achievable for all countable collections if and only if the amount of contamination is finite. For an infinite amount of contamination, we provide several characterizations of when dense generation is possible, showing it is strictly less robust than standard generation. \end{itemize} As a byproduct, we also resolve an open question of (Raman and Raman, 2025) on generation with membership oracle access under finite contamination.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-mehrotra26a, title = {Language Generation with Infinite Contamination}, author = {Mehrotra, Anay and Velegkas, Grigoris and Yu, Xifan and Zhou, Felix}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {5053--5112}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/mehrotra26a/mehrotra26a.pdf}, url = {https://proceedings.mlr.press/v336/mehrotra26a.html}, abstract = {A recent line of work studies language generation in the limit, a formal model of language learning where an algorithm observes an adversarially generated enumeration of strings from an unknown target language $K$ and must eventually generate new, unseen strings from $K$. In this model, Kleinberg and Mullainathan (2024) proved that generation is achievable in surprisingly general settings; whenever $K$ belongs to a known countable collection of languages. However, their generator, while quite general, suffers from “mode collapse:” it generates from an ever-smaller subset of the target. To address this, Kleinberg and Wei (2025a) introduced a stronger notion of dense generation, requiring the output to asymptotically cover a positive fraction of the target, and showed it remains achievable for all countable collections. Both of these works rely on the crucial assumption of \textit{perfect} data: the adversary can neither insert strings from outside the target language (i.e., noise) nor omit strings from it (i.e., omissions). In practice, training data for language models is notoriously noisy, raising the fundamental question: \begin{center} \emph{How much contamination (either omissions or insertions) can language generation tolerate?} \end{center} Recent works have made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) (Raman and Raman, 2025) or omissions (but no noise) (Bai et al., 2026). We characterize the contamination tolerance of both types of generation by proving the following results: \begin{itemize} \item \textbf{Generation under Contamination:} Language generation in the limit is achievable for all countable collections if and only if the fraction of contaminated examples converges to zero. When this condition fails, we characterize the collections which remain generable. \item \textbf{Dense Generation under Contamination:} Dense generation is achievable for all countable collections if and only if the amount of contamination is finite. For an infinite amount of contamination, we provide several characterizations of when dense generation is possible, showing it is strictly less robust than standard generation. \end{itemize} As a byproduct, we also resolve an open question of (Raman and Raman, 2025) on generation with membership oracle access under finite contamination.} }
Endnote
%0 Conference Paper %T Language Generation with Infinite Contamination %A Anay Mehrotra %A Grigoris Velegkas %A Xifan Yu %A Felix Zhou %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-mehrotra26a %I PMLR %P 5053--5112 %U https://proceedings.mlr.press/v336/mehrotra26a.html %V 336 %X A recent line of work studies language generation in the limit, a formal model of language learning where an algorithm observes an adversarially generated enumeration of strings from an unknown target language $K$ and must eventually generate new, unseen strings from $K$. In this model, Kleinberg and Mullainathan (2024) proved that generation is achievable in surprisingly general settings; whenever $K$ belongs to a known countable collection of languages. However, their generator, while quite general, suffers from “mode collapse:” it generates from an ever-smaller subset of the target. To address this, Kleinberg and Wei (2025a) introduced a stronger notion of dense generation, requiring the output to asymptotically cover a positive fraction of the target, and showed it remains achievable for all countable collections. Both of these works rely on the crucial assumption of \textit{perfect} data: the adversary can neither insert strings from outside the target language (i.e., noise) nor omit strings from it (i.e., omissions). In practice, training data for language models is notoriously noisy, raising the fundamental question: \begin{center} \emph{How much contamination (either omissions or insertions) can language generation tolerate?} \end{center} Recent works have made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) (Raman and Raman, 2025) or omissions (but no noise) (Bai et al., 2026). We characterize the contamination tolerance of both types of generation by proving the following results: \begin{itemize} \item \textbf{Generation under Contamination:} Language generation in the limit is achievable for all countable collections if and only if the fraction of contaminated examples converges to zero. When this condition fails, we characterize the collections which remain generable. \item \textbf{Dense Generation under Contamination:} Dense generation is achievable for all countable collections if and only if the amount of contamination is finite. For an infinite amount of contamination, we provide several characterizations of when dense generation is possible, showing it is strictly less robust than standard generation. \end{itemize} As a byproduct, we also resolve an open question of (Raman and Raman, 2025) on generation with membership oracle access under finite contamination.
APA
Mehrotra, A., Velegkas, G., Yu, X. & Zhou, F.. (2026). Language Generation with Infinite Contamination. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:5053-5112 Available from https://proceedings.mlr.press/v336/mehrotra26a.html.

Related Material