$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

Kieu Dang, Phung Lai, Hai Phan, yelong shen, Ruoming Jin, Abdallah Khreishah
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:622-637, 2025.

Abstract

Large language models (LLMs) demonstrate remarkable capabilities across various tasks. However, their deployment introduces significant risks related to intellectual property. In this context, we focus on model stealing attacks, where adversaries replicate the behaviors of these models to steal services. These attacks are highly relevant to proprietary LLMs and pose serious threats to revenue and financial stability. To mitigate these risks, the watermarking solution embeds imperceptible patterns in LLM outputs, enabling model traceability and intellectual property verification. In this paper, we study the vulnerability of LLM service providers by introducing $\\delta$-Steal, a novel model stealing attack that bypasses the service provider’s watermark detectors while preserving the adversary’s model utility. $\\delta$-Steal injects noise into the token embeddings of the adversary’s model during fine-tuning in a way that satisfies local differential privacy (LDP) guarantees. The adversary queries the service provider’s model to collect outputs and form input-output training pairs. By applying LDP-preserving noise to these pairs, $\\delta$-\\Steal obfuscates watermark signals, making it difficult for the service provider to determine whether its outputs were used, thereby preventing claims of model theft. Our experiments show that $\\delta$-Steal with lightweight modifications achieves attack success rates of up to $96.95%$ without significantly compromising the adversary’s model utility. The noise scale in LDP controls the trade-off between attack effectiveness and model utility. This poses a significant risk, as even robust watermarks can be bypassed, allowing adversaries to deceive watermark detectors and undermine current intellectual property protection methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-dang25a, title = {$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy}, author = {Dang, Kieu and Lai, Phung and Phan, Hai and shen, yelong and Jin, Ruoming and Khreishah, Abdallah}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {622--637}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/dang25a/dang25a.pdf}, url = {https://proceedings.mlr.press/v304/dang25a.html}, abstract = {Large language models (LLMs) demonstrate remarkable capabilities across various tasks. However, their deployment introduces significant risks related to intellectual property. In this context, we focus on model stealing attacks, where adversaries replicate the behaviors of these models to steal services. These attacks are highly relevant to proprietary LLMs and pose serious threats to revenue and financial stability. To mitigate these risks, the watermarking solution embeds imperceptible patterns in LLM outputs, enabling model traceability and intellectual property verification. In this paper, we study the vulnerability of LLM service providers by introducing $\\delta$-Steal, a novel model stealing attack that bypasses the service provider’s watermark detectors while preserving the adversary’s model utility. $\\delta$-Steal injects noise into the token embeddings of the adversary’s model during fine-tuning in a way that satisfies local differential privacy (LDP) guarantees. The adversary queries the service provider’s model to collect outputs and form input-output training pairs. By applying LDP-preserving noise to these pairs, $\\delta$-\\Steal obfuscates watermark signals, making it difficult for the service provider to determine whether its outputs were used, thereby preventing claims of model theft. Our experiments show that $\\delta$-Steal with lightweight modifications achieves attack success rates of up to $96.95%$ without significantly compromising the adversary’s model utility. The noise scale in LDP controls the trade-off between attack effectiveness and model utility. This poses a significant risk, as even robust watermarks can be bypassed, allowing adversaries to deceive watermark detectors and undermine current intellectual property protection methods.} }
Endnote
%0 Conference Paper %T $δ$-STEAL: LLM Stealing Attack with Local Differential Privacy %A Kieu Dang %A Phung Lai %A Hai Phan %A yelong shen %A Ruoming Jin %A Abdallah Khreishah %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-dang25a %I PMLR %P 622--637 %U https://proceedings.mlr.press/v304/dang25a.html %V 304 %X Large language models (LLMs) demonstrate remarkable capabilities across various tasks. However, their deployment introduces significant risks related to intellectual property. In this context, we focus on model stealing attacks, where adversaries replicate the behaviors of these models to steal services. These attacks are highly relevant to proprietary LLMs and pose serious threats to revenue and financial stability. To mitigate these risks, the watermarking solution embeds imperceptible patterns in LLM outputs, enabling model traceability and intellectual property verification. In this paper, we study the vulnerability of LLM service providers by introducing $\\delta$-Steal, a novel model stealing attack that bypasses the service provider’s watermark detectors while preserving the adversary’s model utility. $\\delta$-Steal injects noise into the token embeddings of the adversary’s model during fine-tuning in a way that satisfies local differential privacy (LDP) guarantees. The adversary queries the service provider’s model to collect outputs and form input-output training pairs. By applying LDP-preserving noise to these pairs, $\\delta$-\\Steal obfuscates watermark signals, making it difficult for the service provider to determine whether its outputs were used, thereby preventing claims of model theft. Our experiments show that $\\delta$-Steal with lightweight modifications achieves attack success rates of up to $96.95%$ without significantly compromising the adversary’s model utility. The noise scale in LDP controls the trade-off between attack effectiveness and model utility. This poses a significant risk, as even robust watermarks can be bypassed, allowing adversaries to deceive watermark detectors and undermine current intellectual property protection methods.
APA
Dang, K., Lai, P., Phan, H., shen, y., Jin, R. & Khreishah, A.. (2025). $δ$-STEAL: LLM Stealing Attack with Local Differential Privacy. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:622-637 Available from https://proceedings.mlr.press/v304/dang25a.html.

Related Material