ORIN-Lyrics: A Multilingual Nigerian Song Lyrics Dataset and Baseline for Efficient Language Detection

Sakinat O. Folorunso; Oluwagbenga Odunsi; Ayodele David; Daniel Olaleye; Fatimah Salami; Oluwakemi Giwa

ORIN-Lyrics: A Multilingual Nigerian Song Lyrics Dataset and Baseline for Efficient Language Detection

Sakinat O. Folorunso, Oluwagbenga Odunsi, Ayodele David, Daniel Olaleye, Fatimah Salami, Oluwakemi Giwa

Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:131-143, 2026.

Abstract

This study presents ORIN-Lyrics, a multilingual dataset of Nigerian song lyrics following FAIR data principles for multilingual NLP and cultural AI research. The dataset includes 853 songs representing 22 musical genres and 18 language categories, featuring Yoruba, English, Nigerian Pidgin, and code-switched compositions. The corpus contains 124,801 tokens and 12,098 unique words. An embedding-based visualisation method displays distinct semantic groupings between language categories. A baseline genre classification experiment using TF-IDF features and multi-class logistic regression achieves an accuracy of 0.54, substantially exceeding the random baseline of approximately 4.5%, enabling the creation of African language technologies that match local needs.

Cite this Paper

BibTeX

@InProceedings{pmlr-v319-folorunso26b,
  title = 	 {{ORIN-Lyrics}: A Multilingual Nigerian Song Lyrics Dataset and Baseline for Efficient Language Detection},
  author =       {Folorunso, Sakinat O. and Odunsi, Oluwagbenga and David, Ayodele and Olaleye, Daniel and Salami, Fatimah and Giwa, Oluwakemi},
  booktitle = 	 {Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments},
  pages = 	 {131--143},
  year = 	 {2026},
  editor = 	 {Folorunso, Sakinat and Ogundokun, Roseline and Oladipo, Francisca},
  volume = 	 {319},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v319/main/assets/folorunso26b/folorunso26b.pdf},
  url = 	 {https://proceedings.mlr.press/v319/folorunso26b.html},
  abstract = 	 {This study presents ORIN-Lyrics, a multilingual dataset of Nigerian song lyrics following FAIR data principles for multilingual NLP and cultural AI research. The dataset includes 853 songs representing 22 musical genres and 18 language categories, featuring Yoruba, English, Nigerian Pidgin, and code-switched compositions. The corpus contains 124,801 tokens and 12,098 unique words. An embedding-based visualisation method displays distinct semantic groupings between language categories. A baseline genre classification experiment using TF-IDF features and multi-class logistic regression achieves an accuracy of 0.54, substantially exceeding the random baseline of approximately 4.5%, enabling the creation of African language technologies that match local needs.}
}

Endnote

%0 Conference Paper
%T ORIN-Lyrics: A Multilingual Nigerian Song Lyrics Dataset and Baseline for Efficient Language Detection
%A Sakinat O. Folorunso
%A Oluwagbenga Odunsi
%A Ayodele David
%A Daniel Olaleye
%A Fatimah Salami
%A Oluwakemi Giwa
%B Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments
%C Proceedings of Machine Learning Research
%D 2026
%E Sakinat Folorunso
%E Roseline Ogundokun
%E Francisca Oladipo	
%F pmlr-v319-folorunso26b
%I PMLR
%P 131--143
%U https://proceedings.mlr.press/v319/folorunso26b.html
%V 319
%X This study presents ORIN-Lyrics, a multilingual dataset of Nigerian song lyrics following FAIR data principles for multilingual NLP and cultural AI research. The dataset includes 853 songs representing 22 musical genres and 18 language categories, featuring Yoruba, English, Nigerian Pidgin, and code-switched compositions. The corpus contains 124,801 tokens and 12,098 unique words. An embedding-based visualisation method displays distinct semantic groupings between language categories. A baseline genre classification experiment using TF-IDF features and multi-class logistic regression achieves an accuracy of 0.54, substantially exceeding the random baseline of approximately 4.5%, enabling the creation of African language technologies that match local needs.

APA

Folorunso, S.O., Odunsi, O., David, A., Olaleye, D., Salami, F. & Giwa, O.. (2026). ORIN-Lyrics: A Multilingual Nigerian Song Lyrics Dataset and Baseline for Efficient Language Detection. Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, in Proceedings of Machine Learning Research 319:131-143 Available from https://proceedings.mlr.press/v319/folorunso26b.html.

Related Material

Download PDF