Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes

Ata Dundar Yigit; Mohammad Zandsalimy; Shanu Sushmita

Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes

Ata Dundar Yigit, Mohammad Zandsalimy, Shanu Sushmita

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:1092-1099, 2026.

Abstract

Large language models remain vulnerable to jailbreak attacks that bypass safety alignment. Existing defenses often require multi-pass generation or gradient analysis, limiting real-time deployment. We introduce Safety-Weighted Semantic Entropy (SWSE) Probes, a lightweight method for detecting jailbreak attempts at the token-before-generation stage using neural probes on model hidden states. Inspired by semantic entropy approaches for hallucination detection, our method estimates jailbreak likelihood from a single forward pass by training probes on safety-aware entropy scores derived from clustered model responses. Evaluated on Llama-3.2-3B-Instruct using 9,697 harmful and 7,000 benign prompts, our concatenated multi-layer MLP probes achieve ROC AUC of 0.989 and 96.7% accuracy with 100$\times$ less computation than multi-sampling defenses.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-yigit26a,
  title = 	 {Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes},
  author =       {Yigit, Ata Dundar and Zandsalimy, Mohammad and Sushmita, Shanu},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {1092--1099},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/yigit26a/yigit26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/yigit26a.html},
  abstract = 	 {Large language models remain vulnerable to jailbreak attacks that bypass safety alignment. Existing defenses often require multi-pass generation or gradient analysis, limiting real-time deployment. We introduce Safety-Weighted Semantic Entropy (SWSE) Probes, a lightweight method for detecting jailbreak attempts at the token-before-generation stage using neural probes on model hidden states. Inspired by semantic entropy approaches for hallucination detection, our method estimates jailbreak likelihood from a single forward pass by training probes on safety-aware entropy scores derived from clustered model responses. Evaluated on Llama-3.2-3B-Instruct using 9,697 harmful and 7,000 benign prompts, our concatenated multi-layer MLP probes achieve ROC AUC of 0.989 and 96.7% accuracy with 100$\times$ less computation than multi-sampling defenses.}
}

Endnote

%0 Conference Paper
%T Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes
%A Ata Dundar Yigit
%A Mohammad Zandsalimy
%A Shanu Sushmita
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-yigit26a
%I PMLR
%P 1092--1099
%U https://proceedings.mlr.press/v318/yigit26a.html
%V 318
%X Large language models remain vulnerable to jailbreak attacks that bypass safety alignment. Existing defenses often require multi-pass generation or gradient analysis, limiting real-time deployment. We introduce Safety-Weighted Semantic Entropy (SWSE) Probes, a lightweight method for detecting jailbreak attempts at the token-before-generation stage using neural probes on model hidden states. Inspired by semantic entropy approaches for hallucination detection, our method estimates jailbreak likelihood from a single forward pass by training probes on safety-aware entropy scores derived from clustered model responses. Evaluated on Llama-3.2-3B-Instruct using 9,697 harmful and 7,000 benign prompts, our concatenated multi-layer MLP probes achieve ROC AUC of 0.989 and 96.7% accuracy with 100$\times$ less computation than multi-sampling defenses.

APA

Yigit, A.D., Zandsalimy, M. & Sushmita, S.. (2026). Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:1092-1099 Available from https://proceedings.mlr.press/v318/yigit26a.html.

Related Material

Download PDF