[edit]
Real-Time Jailbreak Detection via Safety-Weighted Semantic Entropy Probes
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:1092-1099, 2026.
Abstract
Large language models remain vulnerable to jailbreak attacks that bypass safety alignment. Existing defenses often require multi-pass generation or gradient analysis, limiting real-time deployment. We introduce Safety-Weighted Semantic Entropy (SWSE) Probes, a lightweight method for detecting jailbreak attempts at the token-before-generation stage using neural probes on model hidden states. Inspired by semantic entropy approaches for hallucination detection, our method estimates jailbreak likelihood from a single forward pass by training probes on safety-aware entropy scores derived from clustered model responses. Evaluated on Llama-3.2-3B-Instruct using 9,697 harmful and 7,000 benign prompts, our concatenated multi-layer MLP probes achieve ROC AUC of 0.989 and 96.7% accuracy with 100$\times$ less computation than multi-sampling defenses.