Evaluating LLM Generated Detection Rules in Cybersecurity

Anna Bertiger, Bobby Filar, Aryan Luthra, Stefano Meschiari, Aiden Mitchell, Sam Scholten, Vivek Sharath
Proceedings of the 2025 Conference on Applied Machine Learning for Information Security, PMLR 299:222-238, 2025.

Abstract

LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security’s detection team and those written by Sublime Security’s Automated Detection Engineer (ADÉ), with a thorough analysis of ADÉ’s skills presented in the results section.

Cite this Paper


BibTeX
@InProceedings{pmlr-v299-bertiger25a, title = {Evaluating LLM Generated Detection Rules in Cybersecurity}, author = {Bertiger, Anna and Filar, Bobby and Luthra, Aryan and Meschiari, Stefano and Mitchell, Aiden and Scholten, Sam and Sharath, Vivek}, booktitle = {Proceedings of the 2025 Conference on Applied Machine Learning for Information Security}, pages = {222--238}, year = {2025}, editor = {Raff, Edward and Rudd, Ethan M.}, volume = {299}, series = {Proceedings of Machine Learning Research}, month = {22--24 Oct}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v299/main/assets/bertiger25a/bertiger25a.pdf}, url = {https://proceedings.mlr.press/v299/bertiger25a.html}, abstract = {LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security’s detection team and those written by Sublime Security’s Automated Detection Engineer (ADÉ), with a thorough analysis of ADÉ’s skills presented in the results section.} }
Endnote
%0 Conference Paper %T Evaluating LLM Generated Detection Rules in Cybersecurity %A Anna Bertiger %A Bobby Filar %A Aryan Luthra %A Stefano Meschiari %A Aiden Mitchell %A Sam Scholten %A Vivek Sharath %B Proceedings of the 2025 Conference on Applied Machine Learning for Information Security %C Proceedings of Machine Learning Research %D 2025 %E Edward Raff %E Ethan M. Rudd %F pmlr-v299-bertiger25a %I PMLR %P 222--238 %U https://proceedings.mlr.press/v299/bertiger25a.html %V 299 %X LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security’s detection team and those written by Sublime Security’s Automated Detection Engineer (ADÉ), with a thorough analysis of ADÉ’s skills presented in the results section.
APA
Bertiger, A., Filar, B., Luthra, A., Meschiari, S., Mitchell, A., Scholten, S. & Sharath, V.. (2025). Evaluating LLM Generated Detection Rules in Cybersecurity. Proceedings of the 2025 Conference on Applied Machine Learning for Information Security, in Proceedings of Machine Learning Research 299:222-238 Available from https://proceedings.mlr.press/v299/bertiger25a.html.

Related Material