[edit]
Sauti Halisi: Towards Direct Speech-to-Text Translation for Colloquial and Code-Switched Swahili
Proceedings of the AI for African Languages Conference 2025, PMLR 314:21-26, 2026.
Abstract
Standard Swahili forms the basis of most existing language technologies, yet everyday communication across East Africa relies heavily on colloquial and code-switched varieties such as Sheng and Swahili-English. This mismatch leads to large performance gaps in speech recognition and translation systems, which are further amplified by cascaded ASR and machine translation pipelines. This paper introduces the Sauti Halisi project, which fine-tunes a multilingual, multimodal foundation model for direct speech-to-text translation from colloquial Swahili to English. By bypassing intermediate transcription, the system handles informal speech, slang, and code-switching more robustly than cascaded baselines, representing a step toward more inclusive language technologies.