The State of Large Language Models for African Languages: Progress and Challenges

Kedir Hussen, Walelign Sewunetie, Abinew Ayele, Sukairaj Imam, Eyob Alemu, Shamsuddeen Muhammad, Seid Yimam
DLI 2025 Research Track, PMLR 302:1-27, 2026.

Abstract

Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa’s 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 41 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge’ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, very high computational costs, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages. Keywords: Large Language Models (LLMs), Small Language Models (SLMs), Low resource languages, Specialized SLMs (SSLMs)

Cite this Paper


BibTeX
@InProceedings{pmlr-v302-hussen26a, title = {The State of Large Language Models for African Languages: Progress and Challenges}, author = {Hussen, Kedir and Sewunetie, Walelign and Ayele, Abinew and Imam, Sukairaj and Alemu, Eyob and Muhammad, Shamsuddeen and Yimam, Seid}, booktitle = {DLI 2025 Research Track}, pages = {1--27}, year = {2026}, editor = {Haddad, Hatem and Kahira, Albert Njoroge and Bourhim, Sofia and Olatunji, Iyiola Emmanuel and Makhafola, Lesego and Mwase, Christine}, volume = {302}, series = {Proceedings of Machine Learning Research}, month = {17--22 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v302/main/assets/hussen26a/hussen26a.pdf}, url = {https://proceedings.mlr.press/v302/hussen26a.html}, abstract = {Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa’s 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 41 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge’ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, very high computational costs, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages. Keywords: Large Language Models (LLMs), Small Language Models (SLMs), Low resource languages, Specialized SLMs (SSLMs)} }
Endnote
%0 Conference Paper %T The State of Large Language Models for African Languages: Progress and Challenges %A Kedir Hussen %A Walelign Sewunetie %A Abinew Ayele %A Sukairaj Imam %A Eyob Alemu %A Shamsuddeen Muhammad %A Seid Yimam %B DLI 2025 Research Track %C Proceedings of Machine Learning Research %D 2026 %E Hatem Haddad %E Albert Njoroge Kahira %E Sofia Bourhim %E Iyiola Emmanuel Olatunji %E Lesego Makhafola %E Christine Mwase %F pmlr-v302-hussen26a %I PMLR %P 1--27 %U https://proceedings.mlr.press/v302/hussen26a.html %V 302 %X Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa’s 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 41 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge’ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, very high computational costs, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages. Keywords: Large Language Models (LLMs), Small Language Models (SLMs), Low resource languages, Specialized SLMs (SSLMs)
APA
Hussen, K., Sewunetie, W., Ayele, A., Imam, S., Alemu, E., Muhammad, S. & Yimam, S.. (2026). The State of Large Language Models for African Languages: Progress and Challenges. DLI 2025 Research Track, in Proceedings of Machine Learning Research 302:1-27 Available from https://proceedings.mlr.press/v302/hussen26a.html.

Related Material