Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:9982-10009, 2025.

Abstract

Text watermarking aims to subtly embeds statistical signals into text by controlling the Large Language Model (LLM)’s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only $0.88 per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-cheng25c, title = {Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks}, author = {Cheng, Yixin and Guo, Hongcheng and Li, Yangming and Sigal, Leonid}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {9982--10009}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/cheng25c/cheng25c.pdf}, url = {https://proceedings.mlr.press/v267/cheng25c.html}, abstract = {Text watermarking aims to subtly embeds statistical signals into text by controlling the Large Language Model (LLM)’s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only $0.88 per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model even mobile-level models. Our findings highlight the urgent need for more robust watermarking.} }
Endnote
%0 Conference Paper %T Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks %A Yixin Cheng %A Hongcheng Guo %A Yangming Li %A Leonid Sigal %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-cheng25c %I PMLR %P 9982--10009 %U https://proceedings.mlr.press/v267/cheng25c.html %V 267 %X Text watermarking aims to subtly embeds statistical signals into text by controlling the Large Language Model (LLM)’s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only $0.88 per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model even mobile-level models. Our findings highlight the urgent need for more robust watermarking.
APA
Cheng, Y., Guo, H., Li, Y. & Sigal, L.. (2025). Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:9982-10009 Available from https://proceedings.mlr.press/v267/cheng25c.html.

Related Material