Detecting Phishing Emails in Nigerian Pidgin English Using a Dialect-Aware and Behavioural NLP Model

Zubaida Muhtar Alhassan
Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:144-154, 2026.

Abstract

This study proposes a dialect-aware and behaviourally informed NLP model for detecting phishing emails in Nigerian Pidgin, spoken by over 100 million people in Nigeria. A balanced dataset of 870 emails was created using a hybrid translation and generation process, validated by native speakers. The model combines TF-IDF-based linguistic features with seven behavioural indicators derived from persuasion theory, optimised via a Genetic Algorithm-tuned Random Forest classifier. The system achieved 93.89% accuracy, 100.00% precision, and 87.69% recall, demonstrating the importance of integrating behavioural and linguistic analysis for cybersecurity in low-resource language contexts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v319-alhassan26a, title = {Detecting Phishing Emails in {Nigerian} {Pidgin} {English} Using a Dialect-Aware and Behavioural {NLP} Model}, author = {Alhassan, Zubaida Muhtar}, booktitle = {Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments}, pages = {144--154}, year = {2026}, editor = {Folorunso, Sakinat and Ogundokun, Roseline and Oladipo, Francisca}, volume = {319}, series = {Proceedings of Machine Learning Research}, month = {11--14 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v319/main/assets/alhassan26a/alhassan26a.pdf}, url = {https://proceedings.mlr.press/v319/alhassan26a.html}, abstract = {This study proposes a dialect-aware and behaviourally informed NLP model for detecting phishing emails in Nigerian Pidgin, spoken by over 100 million people in Nigeria. A balanced dataset of 870 emails was created using a hybrid translation and generation process, validated by native speakers. The model combines TF-IDF-based linguistic features with seven behavioural indicators derived from persuasion theory, optimised via a Genetic Algorithm-tuned Random Forest classifier. The system achieved 93.89% accuracy, 100.00% precision, and 87.69% recall, demonstrating the importance of integrating behavioural and linguistic analysis for cybersecurity in low-resource language contexts.} }
Endnote
%0 Conference Paper %T Detecting Phishing Emails in Nigerian Pidgin English Using a Dialect-Aware and Behavioural NLP Model %A Zubaida Muhtar Alhassan %B Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments %C Proceedings of Machine Learning Research %D 2026 %E Sakinat Folorunso %E Roseline Ogundokun %E Francisca Oladipo %F pmlr-v319-alhassan26a %I PMLR %P 144--154 %U https://proceedings.mlr.press/v319/alhassan26a.html %V 319 %X This study proposes a dialect-aware and behaviourally informed NLP model for detecting phishing emails in Nigerian Pidgin, spoken by over 100 million people in Nigeria. A balanced dataset of 870 emails was created using a hybrid translation and generation process, validated by native speakers. The model combines TF-IDF-based linguistic features with seven behavioural indicators derived from persuasion theory, optimised via a Genetic Algorithm-tuned Random Forest classifier. The system achieved 93.89% accuracy, 100.00% precision, and 87.69% recall, demonstrating the importance of integrating behavioural and linguistic analysis for cybersecurity in low-resource language contexts.
APA
Alhassan, Z.M.. (2026). Detecting Phishing Emails in Nigerian Pidgin English Using a Dialect-Aware and Behavioural NLP Model. Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, in Proceedings of Machine Learning Research 319:144-154 Available from https://proceedings.mlr.press/v319/alhassan26a.html.

Related Material