Revisiting Character-level Adversarial Attacks for Language Models

Elias Abad Rocamora, Yongtao Wu, Fanghui Liu, Grigorios Chrysos, Volkan Cevher
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1-30, 2024.

Abstract

Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in $4.84$% points and the USE similarity in $8$% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-abad-rocamora24a, title = {Revisiting Character-level Adversarial Attacks for Language Models}, author = {Abad Rocamora, Elias and Wu, Yongtao and Liu, Fanghui and Chrysos, Grigorios and Cevher, Volkan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {1--30}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/abad-rocamora24a/abad-rocamora24a.pdf}, url = {https://proceedings.mlr.press/v235/abad-rocamora24a.html}, abstract = {Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in $4.84$% points and the USE similarity in $8$% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.} }
Endnote
%0 Conference Paper %T Revisiting Character-level Adversarial Attacks for Language Models %A Elias Abad Rocamora %A Yongtao Wu %A Fanghui Liu %A Grigorios Chrysos %A Volkan Cevher %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-abad-rocamora24a %I PMLR %P 1--30 %U https://proceedings.mlr.press/v235/abad-rocamora24a.html %V 235 %X Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in $4.84$% points and the USE similarity in $8$% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
APA
Abad Rocamora, E., Wu, Y., Liu, F., Chrysos, G. & Cevher, V.. (2024). Revisiting Character-level Adversarial Attacks for Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:1-30 Available from https://proceedings.mlr.press/v235/abad-rocamora24a.html.

Related Material