Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:7306-7331, 2025.

Abstract

Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative—two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of $0.319$ in Attack Success Rate and $0.426$ in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chan25b, title = {Speak Easy: Eliciting Harmful Jailbreaks from {LLM}s with Simple Interactions}, author = {Chan, Yik Siu and Ri, Narutatsu and Xiao, Yuxin and Ghassemi, Marzyeh}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {7306--7331}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chan25b/chan25b.pdf}, url = {https://proceedings.mlr.press/v267/chan25b.html}, abstract = {Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative—two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of $0.319$ in Attack Success Rate and $0.426$ in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.} }
Endnote
%0 Conference Paper %T Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions %A Yik Siu Chan %A Narutatsu Ri %A Yuxin Xiao %A Marzyeh Ghassemi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chan25b %I PMLR %P 7306--7331 %U https://proceedings.mlr.press/v267/chan25b.html %V 267 %X Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative—two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of $0.319$ in Attack Success Rate and $0.426$ in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.
APA
Chan, Y.S., Ri, N., Xiao, Y. & Ghassemi, M.. (2025). Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:7306-7331 Available from https://proceedings.mlr.press/v267/chan25b.html.

Related Material