Sauti Halisi: Towards Direct Speech-to-Text Translation for Colloquial and Code-Switched Swahili

Gill O’Brian
Proceedings of the AI for African Languages Conference 2025, PMLR 314:21-26, 2026.

Abstract

Standard Swahili forms the basis of most existing language technologies, yet everyday communication across East Africa relies heavily on colloquial and code-switched varieties such as Sheng and Swahili-English. This mismatch leads to large performance gaps in speech recognition and translation systems, which are further amplified by cascaded ASR and machine translation pipelines. This paper introduces the Sauti Halisi project, which fine-tunes a multilingual, multimodal foundation model for direct speech-to-text translation from colloquial Swahili to English. By bypassing intermediate transcription, the system handles informal speech, slang, and code-switching more robustly than cascaded baselines, representing a step toward more inclusive language technologies.

Cite this Paper


BibTeX
@InProceedings{pmlr-v314-o-brian26a, title = {Sauti Halisi: Towards Direct Speech-to-Text Translation for Colloquial and Code-Switched Swahili}, author = {O'Brian, Gill}, booktitle = {Proceedings of the AI for African Languages Conference 2025}, pages = {21--26}, year = {2026}, editor = {Bainomugisha, Engineer and Mwebaze, Ernest and Kimera, Richard and Nabende, Joyce Nakatumba and Katumba, Andrew and Quinn, John}, volume = {314}, series = {Proceedings of Machine Learning Research}, month = {10 Oct}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v314/main/assets/o-brian26a/o-brian26a.pdf}, url = {https://proceedings.mlr.press/v314/o-brian26a.html}, abstract = {Standard Swahili forms the basis of most existing language technologies, yet everyday communication across East Africa relies heavily on colloquial and code-switched varieties such as Sheng and Swahili-English. This mismatch leads to large performance gaps in speech recognition and translation systems, which are further amplified by cascaded ASR and machine translation pipelines. This paper introduces the Sauti Halisi project, which fine-tunes a multilingual, multimodal foundation model for direct speech-to-text translation from colloquial Swahili to English. By bypassing intermediate transcription, the system handles informal speech, slang, and code-switching more robustly than cascaded baselines, representing a step toward more inclusive language technologies.} }
Endnote
%0 Conference Paper %T Sauti Halisi: Towards Direct Speech-to-Text Translation for Colloquial and Code-Switched Swahili %A Gill O’Brian %B Proceedings of the AI for African Languages Conference 2025 %C Proceedings of Machine Learning Research %D 2026 %E Engineer Bainomugisha %E Ernest Mwebaze %E Richard Kimera %E Joyce Nakatumba Nabende %E Andrew Katumba %E John Quinn %F pmlr-v314-o-brian26a %I PMLR %P 21--26 %U https://proceedings.mlr.press/v314/o-brian26a.html %V 314 %X Standard Swahili forms the basis of most existing language technologies, yet everyday communication across East Africa relies heavily on colloquial and code-switched varieties such as Sheng and Swahili-English. This mismatch leads to large performance gaps in speech recognition and translation systems, which are further amplified by cascaded ASR and machine translation pipelines. This paper introduces the Sauti Halisi project, which fine-tunes a multilingual, multimodal foundation model for direct speech-to-text translation from colloquial Swahili to English. By bypassing intermediate transcription, the system handles informal speech, slang, and code-switching more robustly than cascaded baselines, representing a step toward more inclusive language technologies.
APA
O’Brian, G.. (2026). Sauti Halisi: Towards Direct Speech-to-Text Translation for Colloquial and Code-Switched Swahili. Proceedings of the AI for African Languages Conference 2025, in Proceedings of Machine Learning Research 314:21-26 Available from https://proceedings.mlr.press/v314/o-brian26a.html.

Related Material