[edit]
Measuring the LEAK: A Fine-Grained Metric for Partial Information Leakage in Attempted Jailbreaking of Large Language Models
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:408-421, 2026.
Abstract
Large language model safety evaluation commonly relies on binary attack success rate (ASR) metrics, which fail to capture partial information leakage when models incompletely comply with adversarial prompts. We propose LEAK (Level of Exposed Actionable Knowledge), a fine-grained metric that decomposes attacker goals into objective components (OCs) and assigns weighted scores based on the degree each component is exposed. This enables precise identification of leaked behaviours and supports targeted safety improvements. We evaluated four scoring mechanisms, chrF, BERTScore, Sentence Transformers, and LLM-as-a-judge—across five open-source models. Embedding-based metrics struggled to distinguish malicious content from benign discussions, while LLM judges demonstrated superior discrimination. At-scale evaluation across 41 OCs spanning phishing, car theft, malware, and bullying showed LLM judges achieved 80-98% accuracy in distinguishing between OCs which should score high and which should score low with respect to the amount of exposed actionable knowledge, with Qwen 8B reaching 97.56%.