SEMINAR: SEMantic InformatioN Augmented JailbReak Attack in LLM

Junjie Yang, Fenghua Weng, Yue Xu, Wenjie Wang
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:1086-1101, 2025.

Abstract

Large Language Models (LLMs) have been widely adopted in real-world applications, yet their safety remains a major concern, particularly regarding jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Among various attack strategies, optimization-based jailbreak attacks have emerged as a primary approach by designing specialized loss functions to optimize adversarial suffixes added after the harmful question. However, existing methods often suffer from poor generalization and over-refusal issues due to overly fixed optimization targets, which significantly undermine the utility of jailbreak attempts by yielding generic denials (e.g., "Sorry, I can’t assist with that") rather than harmful completions. These issues fundamentally stem from the rigid exact match constraint in their loss design. To address this, we propose SEMINAR, a novel semantic information-augmented optimization framework that promotes diverse and semantically aligned affirmative responses. Specifically, we leverages semantic-level supervision to guide the optimization toward intent-consistent outputs rather than rigid templates by introducing a non-exact match loss based on semantic similarity. Furthermore, we mitigate the token shift problem—the generation of LLM highly depends on the correctness of the first few tokens, but the loss is averaged over the entire sequence, which leads to insufficient attention paid to the early tokens in the optimization—by introducing a cosine decay scheduling mechanism that emphasizes the early tokens in the sequence into the optimization process. As a result, SEMINAR not only enhances the diversity of affirmative responses generated by LLMs but also significantly improves overall attack effectiveness. Extensive experiments demonstrate the superiority of SEMINAR over baseline methods, along with its strong transferability across different models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-yang25a, title = {SEMINAR: SEMantic InformatioN Augmented JailbReak Attack in LLM}, author = {Yang, Junjie and Weng, Fenghua and Xu, Yue and Wang, Wenjie}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {1086--1101}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/yang25a/yang25a.pdf}, url = {https://proceedings.mlr.press/v304/yang25a.html}, abstract = {Large Language Models (LLMs) have been widely adopted in real-world applications, yet their safety remains a major concern, particularly regarding jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Among various attack strategies, optimization-based jailbreak attacks have emerged as a primary approach by designing specialized loss functions to optimize adversarial suffixes added after the harmful question. However, existing methods often suffer from poor generalization and over-refusal issues due to overly fixed optimization targets, which significantly undermine the utility of jailbreak attempts by yielding generic denials (e.g., "Sorry, I can’t assist with that") rather than harmful completions. These issues fundamentally stem from the rigid exact match constraint in their loss design. To address this, we propose SEMINAR, a novel semantic information-augmented optimization framework that promotes diverse and semantically aligned affirmative responses. Specifically, we leverages semantic-level supervision to guide the optimization toward intent-consistent outputs rather than rigid templates by introducing a non-exact match loss based on semantic similarity. Furthermore, we mitigate the token shift problem—the generation of LLM highly depends on the correctness of the first few tokens, but the loss is averaged over the entire sequence, which leads to insufficient attention paid to the early tokens in the optimization—by introducing a cosine decay scheduling mechanism that emphasizes the early tokens in the sequence into the optimization process. As a result, SEMINAR not only enhances the diversity of affirmative responses generated by LLMs but also significantly improves overall attack effectiveness. Extensive experiments demonstrate the superiority of SEMINAR over baseline methods, along with its strong transferability across different models.} }
Endnote
%0 Conference Paper %T SEMINAR: SEMantic InformatioN Augmented JailbReak Attack in LLM %A Junjie Yang %A Fenghua Weng %A Yue Xu %A Wenjie Wang %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-yang25a %I PMLR %P 1086--1101 %U https://proceedings.mlr.press/v304/yang25a.html %V 304 %X Large Language Models (LLMs) have been widely adopted in real-world applications, yet their safety remains a major concern, particularly regarding jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Among various attack strategies, optimization-based jailbreak attacks have emerged as a primary approach by designing specialized loss functions to optimize adversarial suffixes added after the harmful question. However, existing methods often suffer from poor generalization and over-refusal issues due to overly fixed optimization targets, which significantly undermine the utility of jailbreak attempts by yielding generic denials (e.g., "Sorry, I can’t assist with that") rather than harmful completions. These issues fundamentally stem from the rigid exact match constraint in their loss design. To address this, we propose SEMINAR, a novel semantic information-augmented optimization framework that promotes diverse and semantically aligned affirmative responses. Specifically, we leverages semantic-level supervision to guide the optimization toward intent-consistent outputs rather than rigid templates by introducing a non-exact match loss based on semantic similarity. Furthermore, we mitigate the token shift problem—the generation of LLM highly depends on the correctness of the first few tokens, but the loss is averaged over the entire sequence, which leads to insufficient attention paid to the early tokens in the optimization—by introducing a cosine decay scheduling mechanism that emphasizes the early tokens in the sequence into the optimization process. As a result, SEMINAR not only enhances the diversity of affirmative responses generated by LLMs but also significantly improves overall attack effectiveness. Extensive experiments demonstrate the superiority of SEMINAR over baseline methods, along with its strong transferability across different models.
APA
Yang, J., Weng, F., Xu, Y. & Wang, W.. (2025). SEMINAR: SEMantic InformatioN Augmented JailbReak Attack in LLM. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:1086-1101 Available from https://proceedings.mlr.press/v304/yang25a.html.

Related Material