Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:662-677, 2026.

Abstract

Large language models (LLMs) employ safety mechanisms to prevent harmful outputs,yet these defenses primarily rely on semantic pattern matching. We show that en-coding harmful prompts as coherent mathematical problems—using formalisms such asset theory, formal logic, and quantum mechanics—bypasses these filters at high rates,achieving 46–56% average attack success across eight target models and two establishedbenchmarks. Crucially, the effectiveness depends not on mathematical notation itself,but on whether a helper LLM deeply reformulates the harmful content into a genuinemathematical problem: rule-based encodings that apply mathematical formatting with-out such reformulation perform no better than unencoded baselines. We introduce anovel Formal Logic encoding that achieves attack success comparable to Set Theory,demonstrating that this vulnerability generalizes across mathematical formalisms. Ad-ditional experiments with repeat post-processing confirm that these attacks are robustto simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) showsubstantially greater robustness than older models, though they remain vulnerable. Ourfindings highlight fundamental gaps in current safety frameworks and motivate defensesthat reason about mathematical structure rather than surface-level semantics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-zhang26a, title = {Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis}, author = {Zhang, Haoyu and Zandsalimy, Mohammad and Sushmita, Shanu}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {662--677}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/zhang26a/zhang26a.pdf}, url = {https://proceedings.mlr.press/v318/zhang26a.html}, abstract = {Large language models (LLMs) employ safety mechanisms to prevent harmful outputs,yet these defenses primarily rely on semantic pattern matching. We show that en-coding harmful prompts as coherent mathematical problems—using formalisms such asset theory, formal logic, and quantum mechanics—bypasses these filters at high rates,achieving 46–56% average attack success across eight target models and two establishedbenchmarks. Crucially, the effectiveness depends not on mathematical notation itself,but on whether a helper LLM deeply reformulates the harmful content into a genuinemathematical problem: rule-based encodings that apply mathematical formatting with-out such reformulation perform no better than unencoded baselines. We introduce anovel Formal Logic encoding that achieves attack success comparable to Set Theory,demonstrating that this vulnerability generalizes across mathematical formalisms. Ad-ditional experiments with repeat post-processing confirm that these attacks are robustto simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) showsubstantially greater robustness than older models, though they remain vulnerable. Ourfindings highlight fundamental gaps in current safety frameworks and motivate defensesthat reason about mathematical structure rather than surface-level semantics.} }
Endnote
%0 Conference Paper %T Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis %A Haoyu Zhang %A Mohammad Zandsalimy %A Shanu Sushmita %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-zhang26a %I PMLR %P 662--677 %U https://proceedings.mlr.press/v318/zhang26a.html %V 318 %X Large language models (LLMs) employ safety mechanisms to prevent harmful outputs,yet these defenses primarily rely on semantic pattern matching. We show that en-coding harmful prompts as coherent mathematical problems—using formalisms such asset theory, formal logic, and quantum mechanics—bypasses these filters at high rates,achieving 46–56% average attack success across eight target models and two establishedbenchmarks. Crucially, the effectiveness depends not on mathematical notation itself,but on whether a helper LLM deeply reformulates the harmful content into a genuinemathematical problem: rule-based encodings that apply mathematical formatting with-out such reformulation perform no better than unencoded baselines. We introduce anovel Formal Logic encoding that achieves attack success comparable to Set Theory,demonstrating that this vulnerability generalizes across mathematical formalisms. Ad-ditional experiments with repeat post-processing confirm that these attacks are robustto simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) showsubstantially greater robustness than older models, though they remain vulnerable. Ourfindings highlight fundamental gaps in current safety frameworks and motivate defensesthat reason about mathematical structure rather than surface-level semantics.
APA
Zhang, H., Zandsalimy, M. & Sushmita, S.. (2026). Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:662-677 Available from https://proceedings.mlr.press/v318/zhang26a.html.

Related Material