[edit]
Large Vocabulary Read-Mode Speech Corpora for Low-Resourced Ometo Languages: Gamo, Gofa, Dawuro and Wolaita
DLI 2025 Research Track, PMLR 302:1-10, 2026.
Abstract
Speech is a fundamental mode of human communication and has also become a popular way for people to interact with machines through the use of speech technology. Automatic Speech Recognition (ASR) is one of speech technologies which transcribes speech to its corresponding text using numerous techniques and facilitates communication between human and electronic devices. To make it a real large amount of speech dataset parallel with its transcription is necessary. However, developing a large amount of corpora through collection and pre-processing is very expensive for many languages, including Ometo languages. This is mainly because they are classified as low-resource languages, lacking sufficient linguistic data and technological resources. In order to solve the problem of data scarcity for Ometo languages: Gamo, Gofa, Dawuro, and Wolaita, we have developed large speech corpora of 24.3511 hr with its corresponding transcription for four Ometo languages. Then, we have developed ASR systems for each language to verify the usability of the corpora using a deep learning technique. We have achieved WER of 72.00%, 57.94%, 62.22%, 64.71% for Gamo, Gofa, Dawuro and Wolaita languages, respectively. In order to demonstrate that the corpora are appropriate for additional research toward the creation of ASR systems, we present the corpora and the baseline ASR systems we have constructed in this study. The corpora can therefore be used by researchers to improve speech processing systems. Keywords: Automatic Speech Recognition, Low-Resource Languages, Ometo Languages, Deep Learning.