Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need

Richard Thompson, Adam Pease, Mathias Kölsch, Angelos Toutsios
Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning, PMLR 284:130-136, 2025.

Abstract

Large Language Models (LLMs) have shown strong performance in translating natural language into programming languages like Python or Java. However, for niche computer languages, where there is limited training data, fine-tuning a base model is often necessary. A key challenge arises when the pretrained embeddings of natural language terms interfere with the intended syntax and semantics of formal language terms. This issue is especially pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday English words. As a result, models often produce syntactic errors or hallucinate non-existent terms due to conflicting embeddings learned during base training. This work introduces a tokenization-based technique to mitigate these issues. By altering how formal terms are tokenized, we can decouple their embeddings from similar natural language words, significantly reducing syntax errors and term hallucinations in the generated formal language output.

Cite this Paper


BibTeX
@InProceedings{pmlr-v284-thompson25a, title = {Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need}, author = {Thompson, Richard and Pease, Adam and K\"{o}lsch, Mathias and Toutsios, Angelos}, booktitle = {Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning}, pages = {130--136}, year = {2025}, editor = {H. Gilpin, Leilani and Giunchiglia, Eleonora and Hitzler, Pascal and van Krieken, Emile}, volume = {284}, series = {Proceedings of Machine Learning Research}, month = {08--10 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v284/main/assets/thompson25a/thompson25a.pdf}, url = {https://proceedings.mlr.press/v284/thompson25a.html}, abstract = {Large Language Models (LLMs) have shown strong performance in translating natural language into programming languages like Python or Java. However, for niche computer languages, where there is limited training data, fine-tuning a base model is often necessary. A key challenge arises when the pretrained embeddings of natural language terms interfere with the intended syntax and semantics of formal language terms. This issue is especially pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday English words. As a result, models often produce syntactic errors or hallucinate non-existent terms due to conflicting embeddings learned during base training. This work introduces a tokenization-based technique to mitigate these issues. By altering how formal terms are tokenized, we can decouple their embeddings from similar natural language words, significantly reducing syntax errors and term hallucinations in the generated formal language output.} }
Endnote
%0 Conference Paper %T Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need %A Richard Thompson %A Adam Pease %A Mathias Kölsch %A Angelos Toutsios %B Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning %C Proceedings of Machine Learning Research %D 2025 %E Leilani H. Gilpin %E Eleonora Giunchiglia %E Pascal Hitzler %E Emile van Krieken %F pmlr-v284-thompson25a %I PMLR %P 130--136 %U https://proceedings.mlr.press/v284/thompson25a.html %V 284 %X Large Language Models (LLMs) have shown strong performance in translating natural language into programming languages like Python or Java. However, for niche computer languages, where there is limited training data, fine-tuning a base model is often necessary. A key challenge arises when the pretrained embeddings of natural language terms interfere with the intended syntax and semantics of formal language terms. This issue is especially pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday English words. As a result, models often produce syntactic errors or hallucinate non-existent terms due to conflicting embeddings learned during base training. This work introduces a tokenization-based technique to mitigate these issues. By altering how formal terms are tokenized, we can decouple their embeddings from similar natural language words, significantly reducing syntax errors and term hallucinations in the generated formal language output.
APA
Thompson, R., Pease, A., Kölsch, M. & Toutsios, A.. (2025). Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need. Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning, in Proceedings of Machine Learning Research 284:130-136 Available from https://proceedings.mlr.press/v284/thompson25a.html.

Related Material