Multilingual Hallucination Gaps

Cléa Chataigner, Afaf Taïk, Golnoosh Farnadi
Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation, PMLR 279:133-155, 2025.

Abstract

Large language models (LLMs) are increasingly used as alternatives to traditional searchengines given their capacity to generate text that resembles human language. However, thisshift is concerning, as LLMs often generate hallucinations—misleading or false informationthat appears highly credible. In this study, we explore the phenomenon of hallucinationsacross multiple languages in free-form text generation, focusing on what we call multilingualhallucination gaps. These gaps reflect differences in the frequency of hallucinated answersdepending on the prompt and language used. To quantify such hallucinations, we used theFActScore metric and extended its framework to a multilingual setting. We conductedexperiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographiesin 19 languages and comparing the results to Wikipedia pages. Our results reveal varia-tions in hallucination rates, especially between high- and low-resource languages, raisingimportant questions about LLM multilingual performance and the challenges in evaluatinghallucinations in multilingual free-form text generation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v279-chataigner25a, title = {Multilingual Hallucination Gaps}, author = {Chataigner, Cl{\'e}a and Ta{\"i}k, Afaf and Farnadi, Golnoosh}, booktitle = {Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation}, pages = {133--155}, year = {2025}, editor = {Rateike, Miriam and Dieng, Awa and Watson-Daniels, Jamelle and Fioretto, Ferdinando and Farnadi, Golnoosh}, volume = {279}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v279/main/assets/chataigner25a/chataigner25a.pdf}, url = {https://proceedings.mlr.press/v279/chataigner25a.html}, abstract = {Large language models (LLMs) are increasingly used as alternatives to traditional searchengines given their capacity to generate text that resembles human language. However, thisshift is concerning, as LLMs often generate hallucinations—misleading or false informationthat appears highly credible. In this study, we explore the phenomenon of hallucinationsacross multiple languages in free-form text generation, focusing on what we call multilingualhallucination gaps. These gaps reflect differences in the frequency of hallucinated answersdepending on the prompt and language used. To quantify such hallucinations, we used theFActScore metric and extended its framework to a multilingual setting. We conductedexperiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographiesin 19 languages and comparing the results to Wikipedia pages. Our results reveal varia-tions in hallucination rates, especially between high- and low-resource languages, raisingimportant questions about LLM multilingual performance and the challenges in evaluatinghallucinations in multilingual free-form text generation.} }
Endnote
%0 Conference Paper %T Multilingual Hallucination Gaps %A Cléa Chataigner %A Afaf Taïk %A Golnoosh Farnadi %B Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation %C Proceedings of Machine Learning Research %D 2025 %E Miriam Rateike %E Awa Dieng %E Jamelle Watson-Daniels %E Ferdinando Fioretto %E Golnoosh Farnadi %F pmlr-v279-chataigner25a %I PMLR %P 133--155 %U https://proceedings.mlr.press/v279/chataigner25a.html %V 279 %X Large language models (LLMs) are increasingly used as alternatives to traditional searchengines given their capacity to generate text that resembles human language. However, thisshift is concerning, as LLMs often generate hallucinations—misleading or false informationthat appears highly credible. In this study, we explore the phenomenon of hallucinationsacross multiple languages in free-form text generation, focusing on what we call multilingualhallucination gaps. These gaps reflect differences in the frequency of hallucinated answersdepending on the prompt and language used. To quantify such hallucinations, we used theFActScore metric and extended its framework to a multilingual setting. We conductedexperiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographiesin 19 languages and comparing the results to Wikipedia pages. Our results reveal varia-tions in hallucination rates, especially between high- and low-resource languages, raisingimportant questions about LLM multilingual performance and the challenges in evaluatinghallucinations in multilingual free-form text generation.
APA
Chataigner, C., Taïk, A. & Farnadi, G.. (2025). Multilingual Hallucination Gaps. Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation, in Proceedings of Machine Learning Research 279:133-155 Available from https://proceedings.mlr.press/v279/chataigner25a.html.

Related Material