Grey-box Extraction of Natural Language Models

Santiago Zanella-Beguelin, Shruti Tople, Andrew Paverd, Boris Köpf
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12278-12286, 2021.

Abstract

Model extraction attacks attempt to replicate a target machine learning model by querying its inference API. State-of-the-art attacks are learning-based and construct replicas by supervised training on the target model’s predictions, but an emerging class of attacks exploit algebraic properties to obtain high-fidelity replicas using orders of magnitude fewer queries. So far, these algebraic attacks have been limited to neural networks with few hidden layers and ReLU activations. In this paper we present algebraic and hybrid algebraic/learning-based attacks on large-scale natural language models. We consider a grey-box setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key findings are that (i) with a frozen encoder, high-fidelity extraction is possible with a small number of in-distribution queries, making extraction attacks indistinguishable from legitimate use; (ii) when the encoder is fine-tuned, a hybrid learning-based/algebraic attack improves over the learning-based state-of-the-art without requiring additional queries.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-zanella-beguelin21a, title = {Grey-box Extraction of Natural Language Models}, author = {Zanella-Beguelin, Santiago and Tople, Shruti and Paverd, Andrew and K{\"o}pf, Boris}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {12278--12286}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/zanella-beguelin21a/zanella-beguelin21a.pdf}, url = {https://proceedings.mlr.press/v139/zanella-beguelin21a.html}, abstract = {Model extraction attacks attempt to replicate a target machine learning model by querying its inference API. State-of-the-art attacks are learning-based and construct replicas by supervised training on the target model’s predictions, but an emerging class of attacks exploit algebraic properties to obtain high-fidelity replicas using orders of magnitude fewer queries. So far, these algebraic attacks have been limited to neural networks with few hidden layers and ReLU activations. In this paper we present algebraic and hybrid algebraic/learning-based attacks on large-scale natural language models. We consider a grey-box setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key findings are that (i) with a frozen encoder, high-fidelity extraction is possible with a small number of in-distribution queries, making extraction attacks indistinguishable from legitimate use; (ii) when the encoder is fine-tuned, a hybrid learning-based/algebraic attack improves over the learning-based state-of-the-art without requiring additional queries.} }
Endnote
%0 Conference Paper %T Grey-box Extraction of Natural Language Models %A Santiago Zanella-Beguelin %A Shruti Tople %A Andrew Paverd %A Boris Köpf %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-zanella-beguelin21a %I PMLR %P 12278--12286 %U https://proceedings.mlr.press/v139/zanella-beguelin21a.html %V 139 %X Model extraction attacks attempt to replicate a target machine learning model by querying its inference API. State-of-the-art attacks are learning-based and construct replicas by supervised training on the target model’s predictions, but an emerging class of attacks exploit algebraic properties to obtain high-fidelity replicas using orders of magnitude fewer queries. So far, these algebraic attacks have been limited to neural networks with few hidden layers and ReLU activations. In this paper we present algebraic and hybrid algebraic/learning-based attacks on large-scale natural language models. We consider a grey-box setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key findings are that (i) with a frozen encoder, high-fidelity extraction is possible with a small number of in-distribution queries, making extraction attacks indistinguishable from legitimate use; (ii) when the encoder is fine-tuned, a hybrid learning-based/algebraic attack improves over the learning-based state-of-the-art without requiring additional queries.
APA
Zanella-Beguelin, S., Tople, S., Paverd, A. & Köpf, B.. (2021). Grey-box Extraction of Natural Language Models. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12278-12286 Available from https://proceedings.mlr.press/v139/zanella-beguelin21a.html.

Related Material