Gandalf the Red: Adaptive Security for LLMs

Niklas Pfister, Václav Volhejn, Manuel Knott, Santiago Arias, Julia Bazinska, Mykhailo Bichurin, Alan Y. Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Damian Pascual-Ortiz, Jakub Podolak, Adrià Romero-López, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:49150-49194, 2025.

Abstract

Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-pfister25a, title = {Gandalf the Red: Adaptive Security for {LLM}s}, author = {Pfister, Niklas and Volhejn, V\'{a}clav and Knott, Manuel and Arias, Santiago and Bazinska, Julia and Bichurin, Mykhailo and Commike, Alan Y. and Darling, Janet and Dienes, Peter and Fiedler, Matthew and Haber, David and Kraft, Matthias and Lancini, Marco and Mathys, Max and Pascual-Ortiz, Damian and Podolak, Jakub and Romero-L\'{o}pez, Adri\`{a} and Shiarlis, Kyriacos and Signer, Andreas and Terek, Zsolt and Theocharis, Athanasios and Timbrell, Daniel and Trautwein, Samuel and Watts, Samuel and Wu, Yun-Han and Rojas-Carulla, Mateo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {49150--49194}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/pfister25a/pfister25a.pdf}, url = {https://proceedings.mlr.press/v267/pfister25a.html}, abstract = {Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.} }
Endnote
%0 Conference Paper %T Gandalf the Red: Adaptive Security for LLMs %A Niklas Pfister %A Václav Volhejn %A Manuel Knott %A Santiago Arias %A Julia Bazinska %A Mykhailo Bichurin %A Alan Y. Commike %A Janet Darling %A Peter Dienes %A Matthew Fiedler %A David Haber %A Matthias Kraft %A Marco Lancini %A Max Mathys %A Damian Pascual-Ortiz %A Jakub Podolak %A Adrià Romero-López %A Kyriacos Shiarlis %A Andreas Signer %A Zsolt Terek %A Athanasios Theocharis %A Daniel Timbrell %A Samuel Trautwein %A Samuel Watts %A Yun-Han Wu %A Mateo Rojas-Carulla %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-pfister25a %I PMLR %P 49150--49194 %U https://proceedings.mlr.press/v267/pfister25a.html %V 267 %X Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.
APA
Pfister, N., Volhejn, V., Knott, M., Arias, S., Bazinska, J., Bichurin, M., Commike, A.Y., Darling, J., Dienes, P., Fiedler, M., Haber, D., Kraft, M., Lancini, M., Mathys, M., Pascual-Ortiz, D., Podolak, J., Romero-López, A., Shiarlis, K., Signer, A., Terek, Z., Theocharis, A., Timbrell, D., Trautwein, S., Watts, S., Wu, Y. & Rojas-Carulla, M.. (2025). Gandalf the Red: Adaptive Security for LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:49150-49194 Available from https://proceedings.mlr.press/v267/pfister25a.html.

Related Material