AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, Florian Tramèr
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:6778-6793, 2025.

Abstract

We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4’s 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-carlini25a, title = {{A}uto{A}dv{E}x{B}ench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses}, author = {Carlini, Nicholas and Debenedetti, Edoardo and Rando, Javier and Nasr, Milad and Tram\`{e}r, Florian}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {6778--6793}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/carlini25a/carlini25a.pdf}, url = {https://proceedings.mlr.press/v267/carlini25a.html}, abstract = {We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4’s 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.} }
Endnote
%0 Conference Paper %T AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses %A Nicholas Carlini %A Edoardo Debenedetti %A Javier Rando %A Milad Nasr %A Florian Tramèr %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-carlini25a %I PMLR %P 6778--6793 %U https://proceedings.mlr.press/v267/carlini25a.html %V 267 %X We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4’s 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
APA
Carlini, N., Debenedetti, E., Rando, J., Nasr, M. & Tramèr, F.. (2025). AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:6778-6793 Available from https://proceedings.mlr.press/v267/carlini25a.html.

Related Material