BanglaNLI: A Benchmark Dataset for Bangla Natural Language Inference

MD Ajmain Mahtab, Atif Ronan, Sheikh Ayatur Rahman, Saleh Mohammad Sajid, Sanjida Tasnim, Farig Sadeque
Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, PMLR 319:264-277, 2026.

Abstract

We present BanglaNLI, a high-quality Bangla Natural Language Inference dataset with expert annotations. The dataset was constructed from 4,200 image captions from the BanglaLekha-ImageCaption dataset written by native Bangla speakers, with three hypotheses (entailment, contradiction, neutral) generated per premise, yielding 12,600 carefully annotated sentence pairs. Annotation artefacts were minimised by instructing annotators to avoid simple heuristics. Inter-annotator agreement measured by Cohen’s Kappa reached $\geq 0.88$, confirming high-quality annotations for this under-resourced language with over 200 million native speakers.

Cite this Paper


BibTeX
@InProceedings{pmlr-v319-mahtab26a, title = {{BanglaNLI}: A Benchmark Dataset for {Bangla} Natural Language Inference}, author = {Mahtab, MD Ajmain and Ronan, Atif and Rahman, Sheikh Ayatur and Sajid, Saleh Mohammad and Tasnim, Sanjida and Sadeque, Farig}, booktitle = {Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments}, pages = {264--277}, year = {2026}, editor = {Folorunso, Sakinat and Ogundokun, Roseline and Oladipo, Francisca}, volume = {319}, series = {Proceedings of Machine Learning Research}, month = {11--14 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v319/main/assets/mahtab26a/mahtab26a.pdf}, url = {https://proceedings.mlr.press/v319/mahtab26a.html}, abstract = {We present BanglaNLI, a high-quality Bangla Natural Language Inference dataset with expert annotations. The dataset was constructed from 4,200 image captions from the BanglaLekha-ImageCaption dataset written by native Bangla speakers, with three hypotheses (entailment, contradiction, neutral) generated per premise, yielding 12,600 carefully annotated sentence pairs. Annotation artefacts were minimised by instructing annotators to avoid simple heuristics. Inter-annotator agreement measured by Cohen’s Kappa reached $\geq 0.88$, confirming high-quality annotations for this under-resourced language with over 200 million native speakers.} }
Endnote
%0 Conference Paper %T BanglaNLI: A Benchmark Dataset for Bangla Natural Language Inference %A MD Ajmain Mahtab %A Atif Ronan %A Sheikh Ayatur Rahman %A Saleh Mohammad Sajid %A Sanjida Tasnim %A Farig Sadeque %B Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments %C Proceedings of Machine Learning Research %D 2026 %E Sakinat Folorunso %E Roseline Ogundokun %E Francisca Oladipo %F pmlr-v319-mahtab26a %I PMLR %P 264--277 %U https://proceedings.mlr.press/v319/mahtab26a.html %V 319 %X We present BanglaNLI, a high-quality Bangla Natural Language Inference dataset with expert annotations. The dataset was constructed from 4,200 image captions from the BanglaLekha-ImageCaption dataset written by native Bangla speakers, with three hypotheses (entailment, contradiction, neutral) generated per premise, yielding 12,600 carefully annotated sentence pairs. Annotation artefacts were minimised by instructing annotators to avoid simple heuristics. Inter-annotator agreement measured by Cohen’s Kappa reached $\geq 0.88$, confirming high-quality annotations for this under-resourced language with over 200 million native speakers.
APA
Mahtab, M.A., Ronan, A., Rahman, S.A., Sajid, S.M., Tasnim, S. & Sadeque, F.. (2026). BanglaNLI: A Benchmark Dataset for Bangla Natural Language Inference. Proceedings of IndabaX Nigeria 2026: Building Scalable AI That Works: From Research to Deployment in Resource-Constrained Environments, in Proceedings of Machine Learning Research 319:264-277 Available from https://proceedings.mlr.press/v319/mahtab26a.html.

Related Material