[edit]
Multilingual Hallucination Gaps
Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation, PMLR 279:133-155, 2025.
Abstract
Large language models (LLMs) are increasingly used as alternatives to traditional searchengines given their capacity to generate text that resembles human language. However, thisshift is concerning, as LLMs often generate hallucinations—misleading or false informationthat appears highly credible. In this study, we explore the phenomenon of hallucinationsacross multiple languages in free-form text generation, focusing on what we call multilingualhallucination gaps. These gaps reflect differences in the frequency of hallucinated answersdepending on the prompt and language used. To quantify such hallucinations, we used theFActScore metric and extended its framework to a multilingual setting. We conductedexperiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographiesin 19 languages and comparing the results to Wikipedia pages. Our results reveal varia-tions in hallucination rates, especially between high- and low-resource languages, raisingimportant questions about LLM multilingual performance and the challenges in evaluatinghallucinations in multilingual free-form text generation.