[edit]
Towards Language Representation for SiSwati: A Comparative Analysis of Sub-word Tokenization Algorithms
DLI 2025 Research Track, PMLR 302:1-10, 2026.
Abstract
Many African languages, including SiSwati, are underrepresented in current AI interactions due to challenges such as imprecise language representation. This study investigates various sub-word tokenization algorithms for building monolingual SiSwati tokenizers, a critical step towards enhancing its linguistic representation. We implement and compare Byte-Pair Encoding (BPE), Unigram Language Model (ULM), and WordPiece algorithms, evaluating their performance with three distinct vocabulary sizes: 32K, 50K, and 70K. The tokenizers’ outputs were assessed on a downstream sentiment analysis task using multiple classifiers. The results demonstrate that sub-word representation is effective for SiSwati and that monolingual tokenizers can achieve morphologically-aware sub-word segmentation. Notably, Unigram with a 32K vocabulary paired with an XGBoost classifier yielded the highest peak F1-score, though BPE and WordPiece also offered more stable performance across different vocabulary capacities, with 32K vocabularies often proving sufficient for these two. These findings highlight the significant interplay between tokenizer algorithm type, vocabulary size, and classifier choice in developing tools for low-resource, morphologically rich languages. Keywords: Tokenization, low-resource, vocabulary size, morphologically-rich language.