A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization

Asif Ahmed Neloy; Md Nazmul Islam

A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization

Asif Ahmed Neloy, Md Nazmul Islam

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:796-811, 2026.

Abstract

We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-neloy26a,
  title = 	 {A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization},
  author =       {Neloy, Asif Ahmed and Islam, Md Nazmul},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {796--811},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/neloy26a/neloy26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/neloy26a.html},
  abstract = 	 {We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.}
}

Endnote

%0 Conference Paper
%T A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization
%A Asif Ahmed Neloy
%A Md Nazmul Islam
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-neloy26a
%I PMLR
%P 796--811
%U https://proceedings.mlr.press/v318/neloy26a.html
%V 318
%X We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.

APA

Neloy, A.A. & Islam, M.N.. (2026). A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:796-811 Available from https://proceedings.mlr.press/v318/neloy26a.html.

Related Material

Download PDF