[edit]
Better Bias Benchmarking of Language Models via Multi-factor Analysis
Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation, PMLR 279:38-67, 2025.
Abstract
Bias benchmarks are important ways to assess fairness and bias of language models (LMs),but the design methodology and metrics used in these benchmarks are typically ad hoc. Wepropose an approach for multi-factor analysis of LM bias benchmarks inspired by methodsfrom health informatics and experimental design. Given a benchmark, we first identifyexperimental factors of three types: domain factors that characterize the subject of theLM prompt, prompt factors that characterize how the prompt is formulated, and modelfactors that characterize the model and parameters used. We use coverage analysis tounderstand which biases the benchmark data examines with respect to these factors. Wethen use multi-factor analyses and metrics to understand the strengths and weaknesses ofthe LM on the benchmark. Prior benchmark analyses reached conclusions by comparingone to three factors at a time, typically using tables and heatmaps without principledmetrics and tests that consider the effects of many factors. We propose examining howthe interactions between factors contribute to bias and develop bias metrics across all sub-groups using subgroup analysis approaches inspired by clinical trial and machine learningfairness research. We illustrate these proposed methods by demonstrating how they yieldadditional insights on the benchmark SocialStigmaQA. We discuss opportunities to createmore effective, efficient, and reusable benchmarks with deeper insights by adopting moresystematic multi-factor experimental design, analysis, and metrics.