Towards Language Representation for SiSwati: A Comparative Analysis of Sub-word Tokenization Algorithms

Msane Thandokuhle, Haddad Hatem
DLI 2025 Research Track, PMLR 302:1-10, 2026.

Abstract

Many African languages, including SiSwati, are underrepresented in current AI interactions due to challenges such as imprecise language representation. This study investigates various sub-word tokenization algorithms for building monolingual SiSwati tokenizers, a critical step towards enhancing its linguistic representation. We implement and compare Byte-Pair Encoding (BPE), Unigram Language Model (ULM), and WordPiece algorithms, evaluating their performance with three distinct vocabulary sizes: 32K, 50K, and 70K. The tokenizers’ outputs were assessed on a downstream sentiment analysis task using multiple classifiers. The results demonstrate that sub-word representation is effective for SiSwati and that monolingual tokenizers can achieve morphologically-aware sub-word segmentation. Notably, Unigram with a 32K vocabulary paired with an XGBoost classifier yielded the highest peak F1-score, though BPE and WordPiece also offered more stable performance across different vocabulary capacities, with 32K vocabularies often proving sufficient for these two. These findings highlight the significant interplay between tokenizer algorithm type, vocabulary size, and classifier choice in developing tools for low-resource, morphologically rich languages. Keywords: Tokenization, low-resource, vocabulary size, morphologically-rich language.

Cite this Paper


BibTeX
@InProceedings{pmlr-v302-thandokuhle26a, title = {Towards Language Representation for SiSwati: A Comparative Analysis of Sub-word Tokenization Algorithms}, author = {Thandokuhle, Msane and Hatem, Haddad}, booktitle = {DLI 2025 Research Track}, pages = {1--10}, year = {2026}, editor = {Haddad, Hatem and Kahira, Albert Njoroge and Bourhim, Sofia and Olatunji, Iyiola Emmanuel and Makhafola, Lesego and Mwase, Christine}, volume = {302}, series = {Proceedings of Machine Learning Research}, month = {17--22 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v302/main/assets/thandokuhle26a/thandokuhle26a.pdf}, url = {https://proceedings.mlr.press/v302/thandokuhle26a.html}, abstract = {Many African languages, including SiSwati, are underrepresented in current AI interactions due to challenges such as imprecise language representation. This study investigates various sub-word tokenization algorithms for building monolingual SiSwati tokenizers, a critical step towards enhancing its linguistic representation. We implement and compare Byte-Pair Encoding (BPE), Unigram Language Model (ULM), and WordPiece algorithms, evaluating their performance with three distinct vocabulary sizes: 32K, 50K, and 70K. The tokenizers’ outputs were assessed on a downstream sentiment analysis task using multiple classifiers. The results demonstrate that sub-word representation is effective for SiSwati and that monolingual tokenizers can achieve morphologically-aware sub-word segmentation. Notably, Unigram with a 32K vocabulary paired with an XGBoost classifier yielded the highest peak F1-score, though BPE and WordPiece also offered more stable performance across different vocabulary capacities, with 32K vocabularies often proving sufficient for these two. These findings highlight the significant interplay between tokenizer algorithm type, vocabulary size, and classifier choice in developing tools for low-resource, morphologically rich languages. Keywords: Tokenization, low-resource, vocabulary size, morphologically-rich language.} }
Endnote
%0 Conference Paper %T Towards Language Representation for SiSwati: A Comparative Analysis of Sub-word Tokenization Algorithms %A Msane Thandokuhle %A Haddad Hatem %B DLI 2025 Research Track %C Proceedings of Machine Learning Research %D 2026 %E Hatem Haddad %E Albert Njoroge Kahira %E Sofia Bourhim %E Iyiola Emmanuel Olatunji %E Lesego Makhafola %E Christine Mwase %F pmlr-v302-thandokuhle26a %I PMLR %P 1--10 %U https://proceedings.mlr.press/v302/thandokuhle26a.html %V 302 %X Many African languages, including SiSwati, are underrepresented in current AI interactions due to challenges such as imprecise language representation. This study investigates various sub-word tokenization algorithms for building monolingual SiSwati tokenizers, a critical step towards enhancing its linguistic representation. We implement and compare Byte-Pair Encoding (BPE), Unigram Language Model (ULM), and WordPiece algorithms, evaluating their performance with three distinct vocabulary sizes: 32K, 50K, and 70K. The tokenizers’ outputs were assessed on a downstream sentiment analysis task using multiple classifiers. The results demonstrate that sub-word representation is effective for SiSwati and that monolingual tokenizers can achieve morphologically-aware sub-word segmentation. Notably, Unigram with a 32K vocabulary paired with an XGBoost classifier yielded the highest peak F1-score, though BPE and WordPiece also offered more stable performance across different vocabulary capacities, with 32K vocabularies often proving sufficient for these two. These findings highlight the significant interplay between tokenizer algorithm type, vocabulary size, and classifier choice in developing tools for low-resource, morphologically rich languages. Keywords: Tokenization, low-resource, vocabulary size, morphologically-rich language.
APA
Thandokuhle, M. & Hatem, H.. (2026). Towards Language Representation for SiSwati: A Comparative Analysis of Sub-word Tokenization Algorithms. DLI 2025 Research Track, in Proceedings of Machine Learning Research 302:1-10 Available from https://proceedings.mlr.press/v302/thandokuhle26a.html.

Related Material