Protecting Language Generation Models via Invisible Watermarking

Xuandong Zhao, Yu-Xiang Wang, Lei Li
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:42187-42199, 2023.

Abstract

Language generation models have been an increasingly powerful enabler to many applications. Many such models offer free or affordable API access which makes them potentially vulnerable to model extraction attacks through distillation. To protect intellectual property (IP) and make fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. However, these methods can be nullified by obvious countermeasures such as “synonym randomization”. To address this issue, we propose GINSW, a novel method to protect text generation models from being stolen through distillation. The key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. We can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. Experimental results show that GINSW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. Our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (mAP) in detecting suspects compared to previous methods against watermark removal attacks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-zhao23i, title = {Protecting Language Generation Models via Invisible Watermarking}, author = {Zhao, Xuandong and Wang, Yu-Xiang and Li, Lei}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {42187--42199}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/zhao23i/zhao23i.pdf}, url = {https://proceedings.mlr.press/v202/zhao23i.html}, abstract = {Language generation models have been an increasingly powerful enabler to many applications. Many such models offer free or affordable API access which makes them potentially vulnerable to model extraction attacks through distillation. To protect intellectual property (IP) and make fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. However, these methods can be nullified by obvious countermeasures such as “synonym randomization”. To address this issue, we propose GINSW, a novel method to protect text generation models from being stolen through distillation. The key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. We can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. Experimental results show that GINSW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. Our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (mAP) in detecting suspects compared to previous methods against watermark removal attacks.} }
Endnote
%0 Conference Paper %T Protecting Language Generation Models via Invisible Watermarking %A Xuandong Zhao %A Yu-Xiang Wang %A Lei Li %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-zhao23i %I PMLR %P 42187--42199 %U https://proceedings.mlr.press/v202/zhao23i.html %V 202 %X Language generation models have been an increasingly powerful enabler to many applications. Many such models offer free or affordable API access which makes them potentially vulnerable to model extraction attacks through distillation. To protect intellectual property (IP) and make fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. However, these methods can be nullified by obvious countermeasures such as “synonym randomization”. To address this issue, we propose GINSW, a novel method to protect text generation models from being stolen through distillation. The key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. We can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. Experimental results show that GINSW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. Our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (mAP) in detecting suspects compared to previous methods against watermark removal attacks.
APA
Zhao, X., Wang, Y. & Li, L.. (2023). Protecting Language Generation Models via Invisible Watermarking. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:42187-42199 Available from https://proceedings.mlr.press/v202/zhao23i.html.

Related Material