Measuring the LEAK: A Fine-Grained Metric for Partial Information Leakage in Attempted Jailbreaking of Large Language Models

Dakota Staples, Paul Cook, Saqib Hakak
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:408-421, 2026.

Abstract

Large language model safety evaluation commonly relies on binary attack success rate (ASR) metrics, which fail to capture partial information leakage when models incompletely comply with adversarial prompts. We propose LEAK (Level of Exposed Actionable Knowledge), a fine-grained metric that decomposes attacker goals into objective components (OCs) and assigns weighted scores based on the degree each component is exposed. This enables precise identification of leaked behaviours and supports targeted safety improvements. We evaluated four scoring mechanisms, chrF, BERTScore, Sentence Transformers, and LLM-as-a-judge—across five open-source models. Embedding-based metrics struggled to distinguish malicious content from benign discussions, while LLM judges demonstrated superior discrimination. At-scale evaluation across 41 OCs spanning phishing, car theft, malware, and bullying showed LLM judges achieved 80-98% accuracy in distinguishing between OCs which should score high and which should score low with respect to the amount of exposed actionable knowledge, with Qwen 8B reaching 97.56%.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-staples26a, title = {Measuring the LEAK: A Fine-Grained Metric for Partial Information Leakage in Attempted Jailbreaking of Large Language Models}, author = {Staples, Dakota and Cook, Paul and Hakak, Saqib}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {408--421}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/staples26a/staples26a.pdf}, url = {https://proceedings.mlr.press/v318/staples26a.html}, abstract = {Large language model safety evaluation commonly relies on binary attack success rate (ASR) metrics, which fail to capture partial information leakage when models incompletely comply with adversarial prompts. We propose LEAK (Level of Exposed Actionable Knowledge), a fine-grained metric that decomposes attacker goals into objective components (OCs) and assigns weighted scores based on the degree each component is exposed. This enables precise identification of leaked behaviours and supports targeted safety improvements. We evaluated four scoring mechanisms, chrF, BERTScore, Sentence Transformers, and LLM-as-a-judge—across five open-source models. Embedding-based metrics struggled to distinguish malicious content from benign discussions, while LLM judges demonstrated superior discrimination. At-scale evaluation across 41 OCs spanning phishing, car theft, malware, and bullying showed LLM judges achieved 80-98% accuracy in distinguishing between OCs which should score high and which should score low with respect to the amount of exposed actionable knowledge, with Qwen 8B reaching 97.56%.} }
Endnote
%0 Conference Paper %T Measuring the LEAK: A Fine-Grained Metric for Partial Information Leakage in Attempted Jailbreaking of Large Language Models %A Dakota Staples %A Paul Cook %A Saqib Hakak %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-staples26a %I PMLR %P 408--421 %U https://proceedings.mlr.press/v318/staples26a.html %V 318 %X Large language model safety evaluation commonly relies on binary attack success rate (ASR) metrics, which fail to capture partial information leakage when models incompletely comply with adversarial prompts. We propose LEAK (Level of Exposed Actionable Knowledge), a fine-grained metric that decomposes attacker goals into objective components (OCs) and assigns weighted scores based on the degree each component is exposed. This enables precise identification of leaked behaviours and supports targeted safety improvements. We evaluated four scoring mechanisms, chrF, BERTScore, Sentence Transformers, and LLM-as-a-judge—across five open-source models. Embedding-based metrics struggled to distinguish malicious content from benign discussions, while LLM judges demonstrated superior discrimination. At-scale evaluation across 41 OCs spanning phishing, car theft, malware, and bullying showed LLM judges achieved 80-98% accuracy in distinguishing between OCs which should score high and which should score low with respect to the amount of exposed actionable knowledge, with Qwen 8B reaching 97.56%.
APA
Staples, D., Cook, P. & Hakak, S.. (2026). Measuring the LEAK: A Fine-Grained Metric for Partial Information Leakage in Attempted Jailbreaking of Large Language Models. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:408-421 Available from https://proceedings.mlr.press/v318/staples26a.html.

Related Material