[edit]
Benchmarking Multimodal Semantic Alignment Between Speech and Text Representations in the Igbo Language
Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:346-354, 2026.
Abstract
We present the first systematic benchmark for cross-modal semantic alignment between speech and text in Igbo, a tonal Niger-Congo language with approximately 45 million speakers and virtually no prior multimodal NLP representation. Using 699 stratified utterance pairs from the WAXAL corpus, zero-shot cross-modal cosine similarity is $-0.0009$, statistically indistinguishable from random. A lightweight linear projection (147,840 parameters) trained with symmetric InfoNCE achieves Speech-to-Text Recall@1 of 0.0658 ($5.1\times$ over zero-shot), Recall@10 of 0.3362 ($3.0\times$), and MRR of 0.1557 ($2.7\times$). Alignment is statistically significant ($t = 15.95$, $p = 2.81 \times 10^{-54}$). All embeddings, evaluation code, and benchmark protocols are released.