[edit]
A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:796-811, 2026.
Abstract
We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.