A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization

Asif Ahmed Neloy, Md Nazmul Islam
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:796-811, 2026.

Abstract

We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-neloy26a, title = {A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization}, author = {Neloy, Asif Ahmed and Islam, Md Nazmul}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {796--811}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/neloy26a/neloy26a.pdf}, url = {https://proceedings.mlr.press/v318/neloy26a.html}, abstract = {We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.} }
Endnote
%0 Conference Paper %T A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization %A Asif Ahmed Neloy %A Md Nazmul Islam %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-neloy26a %I PMLR %P 796--811 %U https://proceedings.mlr.press/v318/neloy26a.html %V 318 %X We conduct the first systematic meta-analysis comparing 20 Retrieval-Augmented Generation (RAG) evaluation frameworks, spanning traditional metrics and interpretability methods, from 2020 through 2026, using identical samples across three knowledge domains. Applying all twenty frameworks to 200 question-context-answer triples from RAGBench, we obtain Cochran’s Q = 10,055.63 (p < 0.001) with I^2 = 99.81%, indicating that the large majority of score variance reflects true differences between frameworks rather than sampling noise. Pairwise Pearson correlations range from r = -0.28 to r = 0.90 (median r = 0.21), and three distinct clusters emerge: LLM-as-judge methods (within-cluster mean r = 0.55), a mixed-methods group (mean r = 0.63), and an outlier cluster containing BERTScore, GaRAGe, HALT-RAG, QAFactEval, and RAGChecker (mean r = 0.01). Cluster assignments are perfectly stable across four normalization schemes (ARI = 1.0), and bootstrap resampling confirms co-assignment probabilities of at least 0.75 within the LLM-as-judge and mixed-methods clusters, with the outlier cluster ranging from 0.46 to 0.97. A cross-cluster consensus protocol labels 92.5% of samples as contested, with only 6% receiving unanimous faithful verdicts. These results demonstrate that current evaluation frameworks do not measure a unified construct, and we provide empirically grounded selection guidelines for future research.
APA
Neloy, A.A. & Islam, M.N.. (2026). A Meta-Analysis of Evaluation Framework Reliability and Cross-Domain Generalization. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:796-811 Available from https://proceedings.mlr.press/v318/neloy26a.html.

Related Material