Large Vocabulary Read-Mode Speech Corpora for Low-Resourced Ometo Languages: Gamo, Gofa, Dawuro and Wolaita

Nebiyu Simon, Micheal Melese, Akililu Elias
DLI 2025 Research Track, PMLR 302:1-10, 2026.

Abstract

Speech is a fundamental mode of human communication and has also become a popular way for people to interact with machines through the use of speech technology. Automatic Speech Recognition (ASR) is one of speech technologies which transcribes speech to its corresponding text using numerous techniques and facilitates communication between human and electronic devices. To make it a real large amount of speech dataset parallel with its transcription is necessary. However, developing a large amount of corpora through collection and pre-processing is very expensive for many languages, including Ometo languages. This is mainly because they are classified as low-resource languages, lacking sufficient linguistic data and technological resources. In order to solve the problem of data scarcity for Ometo languages: Gamo, Gofa, Dawuro, and Wolaita, we have developed large speech corpora of 24.3511 hr with its corresponding transcription for four Ometo languages. Then, we have developed ASR systems for each language to verify the usability of the corpora using a deep learning technique. We have achieved WER of 72.00%, 57.94%, 62.22%, 64.71% for Gamo, Gofa, Dawuro and Wolaita languages, respectively. In order to demonstrate that the corpora are appropriate for additional research toward the creation of ASR systems, we present the corpora and the baseline ASR systems we have constructed in this study. The corpora can therefore be used by researchers to improve speech processing systems. Keywords: Automatic Speech Recognition, Low-Resource Languages, Ometo Languages, Deep Learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v302-simon26a, title = {Large Vocabulary Read-Mode Speech Corpora for Low-Resourced Ometo Languages: Gamo, Gofa, Dawuro and Wolaita}, author = {Simon, Nebiyu and Melese, Micheal and Elias, Akililu}, booktitle = {DLI 2025 Research Track}, pages = {1--10}, year = {2026}, editor = {Haddad, Hatem and Kahira, Albert Njoroge and Bourhim, Sofia and Olatunji, Iyiola Emmanuel and Makhafola, Lesego and Mwase, Christine}, volume = {302}, series = {Proceedings of Machine Learning Research}, month = {17--22 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v302/main/assets/simon26a/simon26a.pdf}, url = {https://proceedings.mlr.press/v302/simon26a.html}, abstract = {Speech is a fundamental mode of human communication and has also become a popular way for people to interact with machines through the use of speech technology. Automatic Speech Recognition (ASR) is one of speech technologies which transcribes speech to its corresponding text using numerous techniques and facilitates communication between human and electronic devices. To make it a real large amount of speech dataset parallel with its transcription is necessary. However, developing a large amount of corpora through collection and pre-processing is very expensive for many languages, including Ometo languages. This is mainly because they are classified as low-resource languages, lacking sufficient linguistic data and technological resources. In order to solve the problem of data scarcity for Ometo languages: Gamo, Gofa, Dawuro, and Wolaita, we have developed large speech corpora of 24.3511 hr with its corresponding transcription for four Ometo languages. Then, we have developed ASR systems for each language to verify the usability of the corpora using a deep learning technique. We have achieved WER of 72.00%, 57.94%, 62.22%, 64.71% for Gamo, Gofa, Dawuro and Wolaita languages, respectively. In order to demonstrate that the corpora are appropriate for additional research toward the creation of ASR systems, we present the corpora and the baseline ASR systems we have constructed in this study. The corpora can therefore be used by researchers to improve speech processing systems. Keywords: Automatic Speech Recognition, Low-Resource Languages, Ometo Languages, Deep Learning.} }
Endnote
%0 Conference Paper %T Large Vocabulary Read-Mode Speech Corpora for Low-Resourced Ometo Languages: Gamo, Gofa, Dawuro and Wolaita %A Nebiyu Simon %A Micheal Melese %A Akililu Elias %B DLI 2025 Research Track %C Proceedings of Machine Learning Research %D 2026 %E Hatem Haddad %E Albert Njoroge Kahira %E Sofia Bourhim %E Iyiola Emmanuel Olatunji %E Lesego Makhafola %E Christine Mwase %F pmlr-v302-simon26a %I PMLR %P 1--10 %U https://proceedings.mlr.press/v302/simon26a.html %V 302 %X Speech is a fundamental mode of human communication and has also become a popular way for people to interact with machines through the use of speech technology. Automatic Speech Recognition (ASR) is one of speech technologies which transcribes speech to its corresponding text using numerous techniques and facilitates communication between human and electronic devices. To make it a real large amount of speech dataset parallel with its transcription is necessary. However, developing a large amount of corpora through collection and pre-processing is very expensive for many languages, including Ometo languages. This is mainly because they are classified as low-resource languages, lacking sufficient linguistic data and technological resources. In order to solve the problem of data scarcity for Ometo languages: Gamo, Gofa, Dawuro, and Wolaita, we have developed large speech corpora of 24.3511 hr with its corresponding transcription for four Ometo languages. Then, we have developed ASR systems for each language to verify the usability of the corpora using a deep learning technique. We have achieved WER of 72.00%, 57.94%, 62.22%, 64.71% for Gamo, Gofa, Dawuro and Wolaita languages, respectively. In order to demonstrate that the corpora are appropriate for additional research toward the creation of ASR systems, we present the corpora and the baseline ASR systems we have constructed in this study. The corpora can therefore be used by researchers to improve speech processing systems. Keywords: Automatic Speech Recognition, Low-Resource Languages, Ometo Languages, Deep Learning.
APA
Simon, N., Melese, M. & Elias, A.. (2026). Large Vocabulary Read-Mode Speech Corpora for Low-Resourced Ometo Languages: Gamo, Gofa, Dawuro and Wolaita. DLI 2025 Research Track, in Proceedings of Machine Learning Research 302:1-10 Available from https://proceedings.mlr.press/v302/simon26a.html.

Related Material