Adversarial Reasoning at Jailbreaking Time

Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, Hamed Hassani
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52526-52551, 2025.

Abstract

As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-sabbaghi25a, title = {Adversarial Reasoning at Jailbreaking Time}, author = {Sabbaghi, Mahdi and Kassianik, Paul and Pappas, George J. and Karbasi, Amin and Hassani, Hamed}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52526--52551}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/sabbaghi25a/sabbaghi25a.pdf}, url = {https://proceedings.mlr.press/v267/sabbaghi25a.html}, abstract = {As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.} }
Endnote
%0 Conference Paper %T Adversarial Reasoning at Jailbreaking Time %A Mahdi Sabbaghi %A Paul Kassianik %A George J. Pappas %A Amin Karbasi %A Hamed Hassani %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-sabbaghi25a %I PMLR %P 52526--52551 %U https://proceedings.mlr.press/v267/sabbaghi25a.html %V 267 %X As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
APA
Sabbaghi, M., Kassianik, P., Pappas, G.J., Karbasi, A. & Hassani, H.. (2025). Adversarial Reasoning at Jailbreaking Time. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52526-52551 Available from https://proceedings.mlr.press/v267/sabbaghi25a.html.

Related Material