Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Maya Pavlova, Erik Brinkman, Krithika Iyer, Vı́tor Albiero, Joanna Bitton, Hailey Nguyen, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:48470-48487, 2025.

Abstract

Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-pavlova25a, title = {Automated Red Teaming with {GOAT}: the Generative Offensive Agent Tester}, author = {Pavlova, Maya and Brinkman, Erik and Iyer, Krithika and Albiero, V\'{\i}tor and Bitton, Joanna and Nguyen, Hailey and Ferrer, Cristian Canton and Evtimov, Ivan and Grattafiori, Aaron}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {48470--48487}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/pavlova25a/pavlova25a.pdf}, url = {https://proceedings.mlr.press/v267/pavlova25a.html}, abstract = {Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset.} }
Endnote
%0 Conference Paper %T Automated Red Teaming with GOAT: the Generative Offensive Agent Tester %A Maya Pavlova %A Erik Brinkman %A Krithika Iyer %A Vı́tor Albiero %A Joanna Bitton %A Hailey Nguyen %A Cristian Canton Ferrer %A Ivan Evtimov %A Aaron Grattafiori %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-pavlova25a %I PMLR %P 48470--48487 %U https://proceedings.mlr.press/v267/pavlova25a.html %V 267 %X Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset.
APA
Pavlova, M., Brinkman, E., Iyer, K., Albiero, V., Bitton, J., Nguyen, H., Ferrer, C.C., Evtimov, I. & Grattafiori, A.. (2025). Automated Red Teaming with GOAT: the Generative Offensive Agent Tester. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:48470-48487 Available from https://proceedings.mlr.press/v267/pavlova25a.html.

Related Material