[edit]
BanglaNLI: A Benchmark Dataset for Bangla Natural Language Inference
Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:264-277, 2026.
Abstract
We present BanglaNLI, a high-quality Bangla Natural Language Inference dataset with expert annotations. The dataset was constructed from 4,200 image captions from the BanglaLekha-ImageCaption dataset written by native Bangla speakers, with three hypotheses (entailment, contradiction, neutral) generated per premise, yielding 12,600 carefully annotated sentence pairs. Annotation artefacts were minimised by instructing annotators to avoid simple heuristics. Inter-annotator agreement measured by Cohen’s Kappa reached $\geq 0.88$, confirming high-quality annotations for this under-resourced language with over 200 million native speakers.