[edit]
ORIN-Lyrics: A Multilingual Nigerian Song Lyrics Dataset and Baseline for Efficient Language Detection
Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:131-143, 2026.
Abstract
This study presents ORIN-Lyrics, a multilingual dataset of Nigerian song lyrics following FAIR data principles for multilingual NLP and cultural AI research. The dataset includes 853 songs representing 22 musical genres and 18 language categories, featuring Yoruba, English, Nigerian Pidgin, and code-switched compositions. The corpus contains 124,801 tokens and 12,098 unique words. An embedding-based visualisation method displays distinct semantic groupings between language categories. A baseline genre classification experiment using TF-IDF features and multi-class logistic regression achieves an accuracy of 0.54, substantially exceeding the random baseline of approximately 4.5%, enabling the creation of African language technologies that match local needs.