The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:830-845, 2025.

Abstract

Vision-Language Models (VLMs) have achieved remarkable performance across various tasks. Unfortunately, due to their multimodal nature, a common jailbreak strategy transforms harmful instructions into visual formats like stylized typography or AI-generated images to bypass safety alignment. Despite numerous heuristic defenses, little research has investigated the underlying rationale behind the jailbreak. In this paper, we introduce an information-theoretic framework to explore the fundamental trade-off between attack effectiveness and stealthiness. Leveraging Fano’s inequality, we show that an attacker’s success probability intrinsically relates to the stealthiness of the generated prompts. We further propose an efficient algorithm to detect non-stealthy jailbreak attacks. Experimental results highlight the inherent tension between strong attacks and detectability, offering a formal lower bound on adversarial strategies and potential defense mechanisms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-kao25a, title = {The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?}, author = {Kao, Ching-Chia and Yu, Chia-Mu and Lu, Chun-Shien and Chen, Chu-Song}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {830--845}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/kao25a/kao25a.pdf}, url = {https://proceedings.mlr.press/v304/kao25a.html}, abstract = {Vision-Language Models (VLMs) have achieved remarkable performance across various tasks. Unfortunately, due to their multimodal nature, a common jailbreak strategy transforms harmful instructions into visual formats like stylized typography or AI-generated images to bypass safety alignment. Despite numerous heuristic defenses, little research has investigated the underlying rationale behind the jailbreak. In this paper, we introduce an information-theoretic framework to explore the fundamental trade-off between attack effectiveness and stealthiness. Leveraging Fano’s inequality, we show that an attacker’s success probability intrinsically relates to the stealthiness of the generated prompts. We further propose an efficient algorithm to detect non-stealthy jailbreak attacks. Experimental results highlight the inherent tension between strong attacks and detectability, offering a formal lower bound on adversarial strategies and potential defense mechanisms.} }
Endnote
%0 Conference Paper %T The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? %A Ching-Chia Kao %A Chia-Mu Yu %A Chun-Shien Lu %A Chu-Song Chen %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-kao25a %I PMLR %P 830--845 %U https://proceedings.mlr.press/v304/kao25a.html %V 304 %X Vision-Language Models (VLMs) have achieved remarkable performance across various tasks. Unfortunately, due to their multimodal nature, a common jailbreak strategy transforms harmful instructions into visual formats like stylized typography or AI-generated images to bypass safety alignment. Despite numerous heuristic defenses, little research has investigated the underlying rationale behind the jailbreak. In this paper, we introduce an information-theoretic framework to explore the fundamental trade-off between attack effectiveness and stealthiness. Leveraging Fano’s inequality, we show that an attacker’s success probability intrinsically relates to the stealthiness of the generated prompts. We further propose an efficient algorithm to detect non-stealthy jailbreak attacks. Experimental results highlight the inherent tension between strong attacks and detectability, offering a formal lower bound on adversarial strategies and potential defense mechanisms.
APA
Kao, C., Yu, C., Lu, C. & Chen, C.. (2025). The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:830-845 Available from https://proceedings.mlr.press/v304/kao25a.html.

Related Material