Benchmarking Multimodal Semantic Alignment Between Speech and Text Representations in the Igbo Language

Chidiebere Christopher

Benchmarking Multimodal Semantic Alignment Between Speech and Text Representations in the Igbo Language

Chidiebere Christopher

Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:346-354, 2026.

Abstract

We present the first systematic benchmark for cross-modal semantic alignment between speech and text in Igbo, a tonal Niger-Congo language with approximately 45 million speakers and virtually no prior multimodal NLP representation. Using 699 stratified utterance pairs from the WAXAL corpus, zero-shot cross-modal cosine similarity is $-0.0009$, statistically indistinguishable from random. A lightweight linear projection (147,840 parameters) trained with symmetric InfoNCE achieves Speech-to-Text Recall@1 of 0.0658 ($5.1\times$ over zero-shot), Recall@10 of 0.3362 ($3.0\times$), and MRR of 0.1557 ($2.7\times$). Alignment is statistically significant ($t = 15.95$, $p = 2.81 \times 10^{-54}$). All embeddings, evaluation code, and benchmark protocols are released.

Cite this Paper

BibTeX

@InProceedings{pmlr-v319-christopher26a,
  title = 	 {Benchmarking Multimodal Semantic Alignment Between Speech and Text Representations in the {Igbo} Language},
  author =       {Christopher, Chidiebere},
  booktitle = 	 {Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments},
  pages = 	 {346--354},
  year = 	 {2026},
  editor = 	 {Folorunso, Sakinat and Ogundokun, Roseline and Oladipo, Francisca},
  volume = 	 {319},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {11--14 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v319/main/assets/christopher26a/christopher26a.pdf},
  url = 	 {https://proceedings.mlr.press/v319/christopher26a.html},
  abstract = 	 {We present the first systematic benchmark for cross-modal semantic alignment between speech and text in Igbo, a tonal Niger-Congo language with approximately 45 million speakers and virtually no prior multimodal NLP representation. Using 699 stratified utterance pairs from the WAXAL corpus, zero-shot cross-modal cosine similarity is $-0.0009$, statistically indistinguishable from random. A lightweight linear projection (147,840 parameters) trained with symmetric InfoNCE achieves Speech-to-Text Recall@1 of 0.0658 ($5.1\times$ over zero-shot), Recall@10 of 0.3362 ($3.0\times$), and MRR of 0.1557 ($2.7\times$). Alignment is statistically significant ($t = 15.95$, $p = 2.81 \times 10^{-54}$). All embeddings, evaluation code, and benchmark protocols are released.}
}

Endnote

%0 Conference Paper
%T Benchmarking Multimodal Semantic Alignment Between Speech and Text Representations in the Igbo Language
%A Chidiebere Christopher
%B Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments
%C Proceedings of Machine Learning Research
%D 2026
%E Sakinat Folorunso
%E Roseline Ogundokun
%E Francisca Oladipo	
%F pmlr-v319-christopher26a
%I PMLR
%P 346--354
%U https://proceedings.mlr.press/v319/christopher26a.html
%V 319
%X We present the first systematic benchmark for cross-modal semantic alignment between speech and text in Igbo, a tonal Niger-Congo language with approximately 45 million speakers and virtually no prior multimodal NLP representation. Using 699 stratified utterance pairs from the WAXAL corpus, zero-shot cross-modal cosine similarity is $-0.0009$, statistically indistinguishable from random. A lightweight linear projection (147,840 parameters) trained with symmetric InfoNCE achieves Speech-to-Text Recall@1 of 0.0658 ($5.1\times$ over zero-shot), Recall@10 of 0.3362 ($3.0\times$), and MRR of 0.1557 ($2.7\times$). Alignment is statistically significant ($t = 15.95$, $p = 2.81 \times 10^{-54}$). All embeddings, evaluation code, and benchmark protocols are released.

APA

Christopher, C.. (2026). Benchmarking Multimodal Semantic Alignment Between Speech and Text Representations in the Igbo Language. Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, in Proceedings of Machine Learning Research 319:346-354 Available from https://proceedings.mlr.press/v319/christopher26a.html.

Related Material

Download PDF