Discovering Spoofing Attempts on Language Model Watermarks

Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:19562-19584, 2025.

Abstract

LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gloaguen25a, title = {Discovering Spoofing Attempts on Language Model Watermarks}, author = {Gloaguen, Thibaud and Jovanovi\'{c}, Nikola and Staab, Robin and Vechev, Martin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {19562--19584}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gloaguen25a/gloaguen25a.pdf}, url = {https://proceedings.mlr.press/v267/gloaguen25a.html}, abstract = {LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.} }
Endnote
%0 Conference Paper %T Discovering Spoofing Attempts on Language Model Watermarks %A Thibaud Gloaguen %A Nikola Jovanović %A Robin Staab %A Martin Vechev %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gloaguen25a %I PMLR %P 19562--19584 %U https://proceedings.mlr.press/v267/gloaguen25a.html %V 267 %X LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.
APA
Gloaguen, T., Jovanović, N., Staab, R. & Vechev, M.. (2025). Discovering Spoofing Attempts on Language Model Watermarks. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:19562-19584 Available from https://proceedings.mlr.press/v267/gloaguen25a.html.

Related Material