AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Nicholas Carlini; Edoardo Debenedetti; Javier Rando; Milad Nasr; Florian Tramèr

AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, Florian Tramèr

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:6778-6793, 2025.

Abstract

We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4’s 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-carlini25a,
  title = 	 {{A}uto{A}dv{E}x{B}ench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses},
  author =       {Carlini, Nicholas and Debenedetti, Edoardo and Rando, Javier and Nasr, Milad and Tram\`{e}r, Florian},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {6778--6793},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/carlini25a/carlini25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/carlini25a.html},
  abstract = 	 {We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4’s 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.}
}

Endnote

%0 Conference Paper
%T AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses
%A Nicholas Carlini
%A Edoardo Debenedetti
%A Javier Rando
%A Milad Nasr
%A Florian Tramèr
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-carlini25a
%I PMLR
%P 6778--6793
%U https://proceedings.mlr.press/v267/carlini25a.html
%V 267
%X We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4’s 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

APA

Carlini, N., Debenedetti, E., Rando, J., Nasr, M. & Tramèr, F.. (2025). AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:6778-6793 Available from https://proceedings.mlr.press/v267/carlini25a.html.

AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Abstract

Cite this Paper

Related Material