FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, Yingwei Ma, Jiaheng Zhang, Bryan Hooi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:38623-38663, 2025.

Abstract

This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97% attack success rate across 8 LLMs on average and $\sim$98% bypass rate against 5 guard models on average.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-liu25z, title = {{F}lip{A}ttack: Jailbreak {LLM}s via Flipping}, author = {Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Ma, Yingwei and Zhang, Jiaheng and Hooi, Bryan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {38623--38663}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/liu25z/liu25z.pdf}, url = {https://proceedings.mlr.press/v267/liu25z.html}, abstract = {This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97% attack success rate across 8 LLMs on average and $\sim$98% bypass rate against 5 guard models on average.} }
Endnote
%0 Conference Paper %T FlipAttack: Jailbreak LLMs via Flipping %A Yue Liu %A Xiaoxin He %A Miao Xiong %A Jinlan Fu %A Shumin Deng %A Yingwei Ma %A Jiaheng Zhang %A Bryan Hooi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-liu25z %I PMLR %P 38623--38663 %U https://proceedings.mlr.press/v267/liu25z.html %V 267 %X This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97% attack success rate across 8 LLMs on average and $\sim$98% bypass rate against 5 guard models on average.
APA
Liu, Y., He, X., Xiong, M., Fu, J., Deng, S., Ma, Y., Zhang, J. & Hooi, B.. (2025). FlipAttack: Jailbreak LLMs via Flipping. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:38623-38663 Available from https://proceedings.mlr.press/v267/liu25z.html.

Related Material