[edit]
Robust Tokenization for Low-Resource Oromo Medical Texts via Novel Lightweight Augmentation
Proceedings of the AI for African Languages Conference 2025, PMLR 314:50-55, 2026.
Abstract
Afaan Oromo presents challenges for natural language processing due to complex morphology and inconsistent spelling, particularly in medical texts. This paper proposes a rule-based data augmentation method that generates synthetic sentence variants using Oromo-specific linguistic rules. Applied to a set of 500 medical sentences, the approach produces 1,500 augmented samples and reduces tokenization errors by 50 percent. Improvements are also observed in token fertility and vocabulary coverage, supporting more effective medical NLP applications in low-resource settings.