Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention

Masaki Arai, Toshiki Shibahara, Daiki Chiba, Mitsuaki Akiyama, Masato Uchida
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:351-366, 2025.

Abstract

Jailbreak attacks reveal a persistent gap between the intended alignment of language models and their actual behavior during inference. To address this, we investigate how such attacks succeed at the internal level of model computation, focusing on attention heads. Unlike previous studies that primarily analyzed why jailbreaks work, our approach aims to develop a defense mechanism. We identify attention heads that influence whether a model produces a harmful or safe response by comparing activation patterns between a harmful prompt that is rejected and its adversarial variant that elicits a harmful response. By interpolating the internal representations of these heads between the two scenarios, we suppress harmful outputs while maintaining appropriate responses to benign prompts. Experiments with representative jailbreak methods, including GCG and AutoDAN, show that our method significantly reduces attack success rates without degrading response quality. For instance, with Llama-2-7b-chat, the average success rate drops from 39.3% to 1.1%. These findings reveal how internal attention dynamics affect output generation and demonstrate that targeted manipulation of internal components can enhance safety without requiring external filters or additional training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-arai25a, title = {Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention}, author = {Arai, Masaki and Shibahara, Toshiki and Chiba, Daiki and Akiyama, Mitsuaki and Uchida, Masato}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {351--366}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/arai25a/arai25a.pdf}, url = {https://proceedings.mlr.press/v304/arai25a.html}, abstract = {Jailbreak attacks reveal a persistent gap between the intended alignment of language models and their actual behavior during inference. To address this, we investigate how such attacks succeed at the internal level of model computation, focusing on attention heads. Unlike previous studies that primarily analyzed why jailbreaks work, our approach aims to develop a defense mechanism. We identify attention heads that influence whether a model produces a harmful or safe response by comparing activation patterns between a harmful prompt that is rejected and its adversarial variant that elicits a harmful response. By interpolating the internal representations of these heads between the two scenarios, we suppress harmful outputs while maintaining appropriate responses to benign prompts. Experiments with representative jailbreak methods, including GCG and AutoDAN, show that our method significantly reduces attack success rates without degrading response quality. For instance, with Llama-2-7b-chat, the average success rate drops from 39.3% to 1.1%. These findings reveal how internal attention dynamics affect output generation and demonstrate that targeted manipulation of internal components can enhance safety without requiring external filters or additional training.} }
Endnote
%0 Conference Paper %T Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention %A Masaki Arai %A Toshiki Shibahara %A Daiki Chiba %A Mitsuaki Akiyama %A Masato Uchida %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-arai25a %I PMLR %P 351--366 %U https://proceedings.mlr.press/v304/arai25a.html %V 304 %X Jailbreak attacks reveal a persistent gap between the intended alignment of language models and their actual behavior during inference. To address this, we investigate how such attacks succeed at the internal level of model computation, focusing on attention heads. Unlike previous studies that primarily analyzed why jailbreaks work, our approach aims to develop a defense mechanism. We identify attention heads that influence whether a model produces a harmful or safe response by comparing activation patterns between a harmful prompt that is rejected and its adversarial variant that elicits a harmful response. By interpolating the internal representations of these heads between the two scenarios, we suppress harmful outputs while maintaining appropriate responses to benign prompts. Experiments with representative jailbreak methods, including GCG and AutoDAN, show that our method significantly reduces attack success rates without degrading response quality. For instance, with Llama-2-7b-chat, the average success rate drops from 39.3% to 1.1%. These findings reveal how internal attention dynamics affect output generation and demonstrate that targeted manipulation of internal components can enhance safety without requiring external filters or additional training.
APA
Arai, M., Shibahara, T., Chiba, D., Akiyama, M. & Uchida, M.. (2025). Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:351-366 Available from https://proceedings.mlr.press/v304/arai25a.html.

Related Material