REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Simon Geisler; Tom Wollschläger; M. H. I. Abdalla; Vincent Cohen-Addad; Johannes Gasteiger; Stephan Günnemann

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:19082-19111, 2025.

Abstract

To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-geisler25a,
  title = 	 {{REINFORCE} Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective},
  author =       {Geisler, Simon and Wollschl\"{a}ger, Tom and Abdalla, M. H. I. and Cohen-Addad, Vincent and Gasteiger, Johannes and G\"{u}nnemann, Stephan},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {19082--19111},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/geisler25a/geisler25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/geisler25a.html},
  abstract = 	 {To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.}
}

Endnote

%0 Conference Paper
%T REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
%A Simon Geisler
%A Tom Wollschläger
%A M. H. I. Abdalla
%A Vincent Cohen-Addad
%A Johannes Gasteiger
%A Stephan Günnemann
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-geisler25a
%I PMLR
%P 19082--19111
%U https://proceedings.mlr.press/v267/geisler25a.html
%V 267
%X To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.

APA

Geisler, S., Wollschläger, T., Abdalla, M.H.I., Cohen-Addad, V., Gasteiger, J. & Günnemann, S.. (2025). REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:19082-19111 Available from https://proceedings.mlr.press/v267/geisler25a.html.

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Abstract

Cite this Paper

Related Material