Better Bias Benchmarking of Language Models via Multi-factor Analysis

Hannah Powers; Ioana Baldini; Dennis Wei; Kristin P. Bennett

Better Bias Benchmarking of Language Models via Multi-factor Analysis

Hannah Powers, Ioana Baldini, Dennis Wei, Kristin P. Bennett

Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation, PMLR 279:38-67, 2025.

Abstract

Bias benchmarks are important ways to assess fairness and bias of language models (LMs),but the design methodology and metrics used in these benchmarks are typically ad hoc. Wepropose an approach for multi-factor analysis of LM bias benchmarks inspired by methodsfrom health informatics and experimental design. Given a benchmark, we first identifyexperimental factors of three types: domain factors that characterize the subject of theLM prompt, prompt factors that characterize how the prompt is formulated, and modelfactors that characterize the model and parameters used. We use coverage analysis tounderstand which biases the benchmark data examines with respect to these factors. Wethen use multi-factor analyses and metrics to understand the strengths and weaknesses ofthe LM on the benchmark. Prior benchmark analyses reached conclusions by comparingone to three factors at a time, typically using tables and heatmaps without principledmetrics and tests that consider the effects of many factors. We propose examining howthe interactions between factors contribute to bias and develop bias metrics across all sub-groups using subgroup analysis approaches inspired by clinical trial and machine learningfairness research. We illustrate these proposed methods by demonstrating how they yieldadditional insights on the benchmark SocialStigmaQA. We discuss opportunities to createmore effective, efficient, and reusable benchmarks with deeper insights by adopting moresystematic multi-factor experimental design, analysis, and metrics.

Cite this Paper

BibTeX

@InProceedings{pmlr-v279-powers25a,
  title = 	 {Better Bias Benchmarking of Language Models via Multi-factor Analysis},
  author =       {Powers, Hannah and Baldini, Ioana and Wei, Dennis and Bennett, Kristin P.},
  booktitle = 	 {Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation},
  pages = 	 {38--67},
  year = 	 {2025},
  editor = 	 {Rateike, Miriam and Dieng, Awa and Watson-Daniels, Jamelle and Fioretto, Ferdinando and Farnadi, Golnoosh},
  volume = 	 {279},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v279/main/assets/powers25a/powers25a.pdf},
  url = 	 {https://proceedings.mlr.press/v279/powers25a.html},
  abstract = 	 {Bias benchmarks are important ways to assess fairness and bias of language models (LMs),but the design methodology and metrics used in these benchmarks are typically ad hoc. Wepropose an approach for multi-factor analysis of LM bias benchmarks inspired by methodsfrom health informatics and experimental design. Given a benchmark, we first identifyexperimental factors of three types: domain factors that characterize the subject of theLM prompt, prompt factors that characterize how the prompt is formulated, and modelfactors that characterize the model and parameters used. We use coverage analysis tounderstand which biases the benchmark data examines with respect to these factors. Wethen use multi-factor analyses and metrics to understand the strengths and weaknesses ofthe LM on the benchmark. Prior benchmark analyses reached conclusions by comparingone to three factors at a time, typically using tables and heatmaps without principledmetrics and tests that consider the effects of many factors. We propose examining howthe interactions between factors contribute to bias and develop bias metrics across all sub-groups using subgroup analysis approaches inspired by clinical trial and machine learningfairness research. We illustrate these proposed methods by demonstrating how they yieldadditional insights on the benchmark SocialStigmaQA. We discuss opportunities to createmore effective, efficient, and reusable benchmarks with deeper insights by adopting moresystematic multi-factor experimental design, analysis, and metrics.}
}

Endnote

%0 Conference Paper
%T Better Bias Benchmarking of Language Models via Multi-factor Analysis
%A Hannah Powers
%A Ioana Baldini
%A Dennis Wei
%A Kristin P. Bennett
%B Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation
%C Proceedings of Machine Learning Research
%D 2025
%E Miriam Rateike
%E Awa Dieng
%E Jamelle Watson-Daniels
%E Ferdinando Fioretto
%E Golnoosh Farnadi	
%F pmlr-v279-powers25a
%I PMLR
%P 38--67
%U https://proceedings.mlr.press/v279/powers25a.html
%V 279
%X Bias benchmarks are important ways to assess fairness and bias of language models (LMs),but the design methodology and metrics used in these benchmarks are typically ad hoc. Wepropose an approach for multi-factor analysis of LM bias benchmarks inspired by methodsfrom health informatics and experimental design. Given a benchmark, we first identifyexperimental factors of three types: domain factors that characterize the subject of theLM prompt, prompt factors that characterize how the prompt is formulated, and modelfactors that characterize the model and parameters used. We use coverage analysis tounderstand which biases the benchmark data examines with respect to these factors. Wethen use multi-factor analyses and metrics to understand the strengths and weaknesses ofthe LM on the benchmark. Prior benchmark analyses reached conclusions by comparingone to three factors at a time, typically using tables and heatmaps without principledmetrics and tests that consider the effects of many factors. We propose examining howthe interactions between factors contribute to bias and develop bias metrics across all sub-groups using subgroup analysis approaches inspired by clinical trial and machine learningfairness research. We illustrate these proposed methods by demonstrating how they yieldadditional insights on the benchmark SocialStigmaQA. We discuss opportunities to createmore effective, efficient, and reusable benchmarks with deeper insights by adopting moresystematic multi-factor experimental design, analysis, and metrics.

APA

Powers, H., Baldini, I., Wei, D. & Bennett, K.P.. (2025). Better Bias Benchmarking of Language Models via Multi-factor Analysis. Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation, in Proceedings of Machine Learning Research 279:38-67 Available from https://proceedings.mlr.press/v279/powers25a.html.

Better Bias Benchmarking of Language Models via Multi-factor Analysis

Abstract

Cite this Paper

Related Material