PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang; Qizhen Zhang; Jakob Nicolaus Foerster

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang, Qizhen Zhang, Jakob Nicolaus Foerster

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60271-60287, 2024.

Abstract

Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM’s inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can’t do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset. Code and data are available at https://github.com/Ed-Zh/PARDEN.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-zhang24ca,
  title = 	 {{PARDEN}, Can You Repeat That? {D}efending against Jailbreaks via Repetition},
  author =       {Zhang, Ziyang and Zhang, Qizhen and Foerster, Jakob Nicolaus},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {60271--60287},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24ca/zhang24ca.pdf},
  url = 	 {https://proceedings.mlr.press/v235/zhang24ca.html},
  abstract = 	 {Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM’s inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can’t do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset. Code and data are available at https://github.com/Ed-Zh/PARDEN.}
}

Endnote

%0 Conference Paper
%T PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
%A Ziyang Zhang
%A Qizhen Zhang
%A Jakob Nicolaus Foerster
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-zhang24ca
%I PMLR
%P 60271--60287
%U https://proceedings.mlr.press/v235/zhang24ca.html
%V 235
%X Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM’s inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can’t do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset. Code and data are available at https://github.com/Ed-Zh/PARDEN.

APA

Zhang, Z., Zhang, Q. & Foerster, J.N.. (2024). PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:60271-60287 Available from https://proceedings.mlr.press/v235/zhang24ca.html.

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Abstract

Cite this Paper

Related Material