[edit]
Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need
Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning, PMLR 284:130-136, 2025.
Abstract
Large Language Models (LLMs) have shown strong performance in translating natural language into programming languages like Python or Java. However, for niche computer languages, where there is limited training data, fine-tuning a base model is often necessary. A key challenge arises when the pretrained embeddings of natural language terms interfere with the intended syntax and semantics of formal language terms. This issue is especially pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday English words. As a result, models often produce syntactic errors or hallucinate non-existent terms due to conflicting embeddings learned during base training. This work introduces a tokenization-based technique to mitigate these issues. By altering how formal terms are tokenized, we can decouple their embeddings from similar natural language words, significantly reducing syntax errors and term hallucinations in the generated formal language output.