[edit]
Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:662-677, 2026.
Abstract
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs,yet these defenses primarily rely on semantic pattern matching. We show that en-coding harmful prompts as coherent mathematical problems—using formalisms such asset theory, formal logic, and quantum mechanics—bypasses these filters at high rates,achieving 46–56% average attack success across eight target models and two establishedbenchmarks. Crucially, the effectiveness depends not on mathematical notation itself,but on whether a helper LLM deeply reformulates the harmful content into a genuinemathematical problem: rule-based encodings that apply mathematical formatting with-out such reformulation perform no better than unencoded baselines. We introduce anovel Formal Logic encoding that achieves attack success comparable to Set Theory,demonstrating that this vulnerability generalizes across mathematical formalisms. Ad-ditional experiments with repeat post-processing confirm that these attacks are robustto simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) showsubstantially greater robustness than older models, though they remain vulnerable. Ourfindings highlight fundamental gaps in current safety frameworks and motivate defensesthat reason about mathematical structure rather than surface-level semantics.