Ẹhugbo Ka! Advancing Machine Translation for the Low-Resource Ẹhugbo Language through Parallel Corpus Development

Ukachi Eze-Mbey, Uloma Eze-Mbey, Ololade Anjuwon
DLI 2025 Research Track, PMLR 302:1-9, 2026.

Abstract

Despite advancements in language technologies, there consistently seems to be an exclusion of low-resource African languages and their dialects like Ẹhugbo, a critically endangered variant of Igbo spoken by fewer than 150,000 people in Afikpo, Nigeria. This exclusion perpetuates social and linguistic inequities, leaving speakers of such dialects without access to digital tools that could preserve their language and culture. This paper presents Ẹhugbo Ka! (”Greetings Ẹhugbo!”) addresses this gap. We gathered and built the only publicly available parallel corpus, 1,021 Ẹhugbo-English sentences from the New Testament of the Bible, we evaluated and fine-tuned two state-of-the-art models, M2M100 (facebook/m2m100 418M) and NLLB (facebook/nllb-200-distilled-600M). Initial results were stark: M2M100 achieved a BLEU score of 1.2188, while NLLB scored only 0.0262. After fine-tuning, M2M100 improved to 16.1719, and NLLB achieved 20.4016, demonstrating the potential of adapting LLMs for low-resource languages. Our findings reveal both promise and challenges. While fine-tuning significantly improves performance, the lack of diverse datasets limits translation quality and reinforces the need for inclusive data collection practices. This work highlights the importance of community-driven approaches, as linguistic preservation cannot be achieved without the active involvement of native speakers.This project not only advances the field of low resource MT but also serves as a call to action for researchers and developers to prioritize linguistic diversity, ensuring that no language is left behind in the digital age. Keywords: multilingual low resource, resources for less-resourced languages, minoritized languages, less resourced languages, endangered languages, indigenous languages, corpus creation, multilingual corpora, evaluation, datasets for low resource languages, Igbo, Igbo language.

Cite this Paper


BibTeX
@InProceedings{pmlr-v302-eze-mbey26a, title = {Ẹhugbo Ka! Advancing Machine Translation for the Low-Resource Ẹhugbo Language through Parallel Corpus Development}, author = {Eze-Mbey, Ukachi and Eze-Mbey, Uloma and Anjuwon, Ololade}, booktitle = {DLI 2025 Research Track}, pages = {1--9}, year = {2026}, editor = {Haddad, Hatem and Kahira, Albert Njoroge and Bourhim, Sofia and Olatunji, Iyiola Emmanuel and Makhafola, Lesego and Mwase, Christine}, volume = {302}, series = {Proceedings of Machine Learning Research}, month = {17--22 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v302/main/assets/eze-mbey26a/eze-mbey26a.pdf}, url = {https://proceedings.mlr.press/v302/eze-mbey26a.html}, abstract = {Despite advancements in language technologies, there consistently seems to be an exclusion of low-resource African languages and their dialects like Ẹhugbo, a critically endangered variant of Igbo spoken by fewer than 150,000 people in Afikpo, Nigeria. This exclusion perpetuates social and linguistic inequities, leaving speakers of such dialects without access to digital tools that could preserve their language and culture. This paper presents Ẹhugbo Ka! (”Greetings Ẹhugbo!”) addresses this gap. We gathered and built the only publicly available parallel corpus, 1,021 Ẹhugbo-English sentences from the New Testament of the Bible, we evaluated and fine-tuned two state-of-the-art models, M2M100 (facebook/m2m100 418M) and NLLB (facebook/nllb-200-distilled-600M). Initial results were stark: M2M100 achieved a BLEU score of 1.2188, while NLLB scored only 0.0262. After fine-tuning, M2M100 improved to 16.1719, and NLLB achieved 20.4016, demonstrating the potential of adapting LLMs for low-resource languages. Our findings reveal both promise and challenges. While fine-tuning significantly improves performance, the lack of diverse datasets limits translation quality and reinforces the need for inclusive data collection practices. This work highlights the importance of community-driven approaches, as linguistic preservation cannot be achieved without the active involvement of native speakers.This project not only advances the field of low resource MT but also serves as a call to action for researchers and developers to prioritize linguistic diversity, ensuring that no language is left behind in the digital age. Keywords: multilingual low resource, resources for less-resourced languages, minoritized languages, less resourced languages, endangered languages, indigenous languages, corpus creation, multilingual corpora, evaluation, datasets for low resource languages, Igbo, Igbo language.} }
Endnote
%0 Conference Paper %T Ẹhugbo Ka! Advancing Machine Translation for the Low-Resource Ẹhugbo Language through Parallel Corpus Development %A Ukachi Eze-Mbey %A Uloma Eze-Mbey %A Ololade Anjuwon %B DLI 2025 Research Track %C Proceedings of Machine Learning Research %D 2026 %E Hatem Haddad %E Albert Njoroge Kahira %E Sofia Bourhim %E Iyiola Emmanuel Olatunji %E Lesego Makhafola %E Christine Mwase %F pmlr-v302-eze-mbey26a %I PMLR %P 1--9 %U https://proceedings.mlr.press/v302/eze-mbey26a.html %V 302 %X Despite advancements in language technologies, there consistently seems to be an exclusion of low-resource African languages and their dialects like Ẹhugbo, a critically endangered variant of Igbo spoken by fewer than 150,000 people in Afikpo, Nigeria. This exclusion perpetuates social and linguistic inequities, leaving speakers of such dialects without access to digital tools that could preserve their language and culture. This paper presents Ẹhugbo Ka! (”Greetings Ẹhugbo!”) addresses this gap. We gathered and built the only publicly available parallel corpus, 1,021 Ẹhugbo-English sentences from the New Testament of the Bible, we evaluated and fine-tuned two state-of-the-art models, M2M100 (facebook/m2m100 418M) and NLLB (facebook/nllb-200-distilled-600M). Initial results were stark: M2M100 achieved a BLEU score of 1.2188, while NLLB scored only 0.0262. After fine-tuning, M2M100 improved to 16.1719, and NLLB achieved 20.4016, demonstrating the potential of adapting LLMs for low-resource languages. Our findings reveal both promise and challenges. While fine-tuning significantly improves performance, the lack of diverse datasets limits translation quality and reinforces the need for inclusive data collection practices. This work highlights the importance of community-driven approaches, as linguistic preservation cannot be achieved without the active involvement of native speakers.This project not only advances the field of low resource MT but also serves as a call to action for researchers and developers to prioritize linguistic diversity, ensuring that no language is left behind in the digital age. Keywords: multilingual low resource, resources for less-resourced languages, minoritized languages, less resourced languages, endangered languages, indigenous languages, corpus creation, multilingual corpora, evaluation, datasets for low resource languages, Igbo, Igbo language.
APA
Eze-Mbey, U., Eze-Mbey, U. & Anjuwon, O.. (2026). Ẹhugbo Ka! Advancing Machine Translation for the Low-Resource Ẹhugbo Language through Parallel Corpus Development. DLI 2025 Research Track, in Proceedings of Machine Learning Research 302:1-9 Available from https://proceedings.mlr.press/v302/eze-mbey26a.html.

Related Material