Robust Tokenization for Low-Resource Oromo Medical Texts via Novel Lightweight Augmentation

Karthik Srikumar
Proceedings of the AI for African Languages Conference 2025, PMLR 314:50-55, 2026.

Abstract

Afaan Oromo presents challenges for natural language processing due to complex morphology and inconsistent spelling, particularly in medical texts. This paper proposes a rule-based data augmentation method that generates synthetic sentence variants using Oromo-specific linguistic rules. Applied to a set of 500 medical sentences, the approach produces 1,500 augmented samples and reduces tokenization errors by 50 percent. Improvements are also observed in token fertility and vocabulary coverage, supporting more effective medical NLP applications in low-resource settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v314-srikumar26a, title = {Robust Tokenization for Low-Resource Oromo Medical Texts via Novel Lightweight Augmentation}, author = {Srikumar, Karthik}, booktitle = {Proceedings of the AI for African Languages Conference 2025}, pages = {50--55}, year = {2026}, editor = {Bainomugisha, Engineer and Mwebaze, Ernest and Kimera, Richard and Nabende, Joyce Nakatumba and Katumba, Andrew and Quinn, John}, volume = {314}, series = {Proceedings of Machine Learning Research}, month = {10 Oct}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v314/main/assets/srikumar26a/srikumar26a.pdf}, url = {https://proceedings.mlr.press/v314/srikumar26a.html}, abstract = {Afaan Oromo presents challenges for natural language processing due to complex morphology and inconsistent spelling, particularly in medical texts. This paper proposes a rule-based data augmentation method that generates synthetic sentence variants using Oromo-specific linguistic rules. Applied to a set of 500 medical sentences, the approach produces 1,500 augmented samples and reduces tokenization errors by 50 percent. Improvements are also observed in token fertility and vocabulary coverage, supporting more effective medical NLP applications in low-resource settings.} }
Endnote
%0 Conference Paper %T Robust Tokenization for Low-Resource Oromo Medical Texts via Novel Lightweight Augmentation %A Karthik Srikumar %B Proceedings of the AI for African Languages Conference 2025 %C Proceedings of Machine Learning Research %D 2026 %E Engineer Bainomugisha %E Ernest Mwebaze %E Richard Kimera %E Joyce Nakatumba Nabende %E Andrew Katumba %E John Quinn %F pmlr-v314-srikumar26a %I PMLR %P 50--55 %U https://proceedings.mlr.press/v314/srikumar26a.html %V 314 %X Afaan Oromo presents challenges for natural language processing due to complex morphology and inconsistent spelling, particularly in medical texts. This paper proposes a rule-based data augmentation method that generates synthetic sentence variants using Oromo-specific linguistic rules. Applied to a set of 500 medical sentences, the approach produces 1,500 augmented samples and reduces tokenization errors by 50 percent. Improvements are also observed in token fertility and vocabulary coverage, supporting more effective medical NLP applications in low-resource settings.
APA
Srikumar, K.. (2026). Robust Tokenization for Low-Resource Oromo Medical Texts via Novel Lightweight Augmentation. Proceedings of the AI for African Languages Conference 2025, in Proceedings of Machine Learning Research 314:50-55 Available from https://proceedings.mlr.press/v314/srikumar26a.html.

Related Material