
- title: 'Algorithmic Fairness Through the Lens of Metrics and Evaluation (AFME) 2024'
  volume: 279
  URL: https://proceedings.mlr.press/v279/rateike25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/rateike25a/rateike25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-rateike25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 1-7
  id: rateike25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 1
  lastpage: 7
  published: 2025-04-23 00:00:00 +0000
- title: 'From Models to Systems: A Comprehensive Framework for AI System Fairness in Compositional Recommender Systems'
  abstract: 'Fairness research in machine learning often centers on ensuring equitable performance ofindividual models. However, real-world recommendation systems are built on multiplemodels and even multiple stages, from candidate retrieval to scoring and serving, whichraises challenges for responsible development and deployment. This AI system-level view,as highlighted by regulations like the EU AI Act, necessitates moving beyond auditingindividual models as independent entities. We propose a holistic framework for modelingAI system-level fairness, focusing on the end-utility delivered to diverse user groups, andconsider interactions between components such as retrieval and scoring models. We provideformal insights on the limitations of focusing solely on model-level fairness and highlight theneed for alternative tools that account for heterogeneity in user preferences. To mitigatesystem-level disparities, we adapt closed-box optimization tools (e.g., BayesOpt) to jointlyoptimize utility and equity. We empirically demonstrate the effectiveness of our proposedframework on synthetic and real datasets, underscoring the need for a framework thatreflects the design of modern, industrial AI systems.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/hsu25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/hsu25a/hsu25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-hsu25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Brian
    family: Hsu
  - given: Cyrus
    family: DiCiccio
  - given: Natesh S.
    family: Pillai
  - given: Hongseok
    family: Namkoong
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 8-37
  id: hsu25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 8
  lastpage: 37
  published: 2025-04-23 00:00:00 +0000
- title: 'Better Bias Benchmarking of Language Models via Multi-factor Analysis'
  abstract: 'Bias benchmarks are important ways to assess fairness and bias of language models (LMs),but the design methodology and metrics used in these benchmarks are typically ad hoc. Wepropose an approach for multi-factor analysis of LM bias benchmarks inspired by methodsfrom health informatics and experimental design. Given a benchmark, we first identifyexperimental factors of three types: domain factors that characterize the subject of theLM prompt, prompt factors that characterize how the prompt is formulated, and modelfactors that characterize the model and parameters used. We use coverage analysis tounderstand which biases the benchmark data examines with respect to these factors. Wethen use multi-factor analyses and metrics to understand the strengths and weaknesses ofthe LM on the benchmark. Prior benchmark analyses reached conclusions by comparingone to three factors at a time, typically using tables and heatmaps without principledmetrics and tests that consider the effects of many factors. We propose examining howthe interactions between factors contribute to bias and develop bias metrics across all sub-groups using subgroup analysis approaches inspired by clinical trial and machine learningfairness research. We illustrate these proposed methods by demonstrating how they yieldadditional insights on the benchmark SocialStigmaQA. We discuss opportunities to createmore effective, efficient, and reusable benchmarks with deeper insights by adopting moresystematic multi-factor experimental design, analysis, and metrics.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/powers25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/powers25a/powers25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-powers25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Hannah
    family: Powers
  - given: Ioana
    family: Baldini
  - given: Dennis
    family: Wei
  - given: Kristin P.
    family: Bennett
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 38-67
  id: powers25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 38
  lastpage: 67
  published: 2025-04-23 00:00:00 +0000
- title: 'The Intersectionality Problem for Algorithmic Fairness'
  abstract: 'A yet unmet challenge in algorithmic fairness is the problem of intersectionality, that is,achieving fairness across the intersection of multiple groups—and verifying that such fairnesshas been attained. Because intersectional groups tend to be small, verifying whether amodel is fair raises statistical as well as moral-methodological challenges. This paper (1)elucidates the problem of intersectionality in algorithmic fairness, (2) develops desiderata toclarify the challenges underlying the problem and guide the search for potential solutions,(3) illustrates the desiderata and potential solutions by sketching a proposal using simplehypothesis testing, and (4) evaluates, partly empirically, this proposal against the proposeddesiderata.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/himmelreich25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/himmelreich25a/himmelreich25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-himmelreich25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Johannes
    family: Himmelreich
  - given: Arbie
    family: Hsu
  - given: Ellen
    family: Veomett
  - given: Kristian
    family: Lum
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 68-95
  id: himmelreich25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 68
  lastpage: 95
  published: 2025-04-23 00:00:00 +0000
- title: 'Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML'
  abstract: 'With fairness concerns gaining significant attention in Machine Learning (ML), several biasmitigation techniques have been proposed, often compared against each other to find thebest method. These benchmarking efforts tend to use a common setup for evaluation underthe assumption that providing a uniform environment ensures a fair comparison. However,bias mitigation techniques are sensitive to hyperparameter choices, random seeds, featureselection, etc., meaning that comparison on just one setting can unfairly favour certainalgorithms. In this work, we show significant variance in fairness achieved by several al-gorithms and the influence of the learning pipeline on fairness scores. We highlight thatmost bias mitigation techniques can achieve comparable performance, given the freedomto perform hyperparameter optimization, suggesting that the choice of the evaluation pa-rameters—rather than the mitigation technique itself—can sometimes create the perceivedsuperiority of one method over another. We hope our work encourages future research onhow various choices in the lifecycle of developing an algorithm impact fairness, and trendsthat guide the selection of appropriate algorithms.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/ganesh25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/ganesh25a/ganesh25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-ganesh25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Prakhar
    family: Ganesh
  - given: Usman
    family: Gohar
  - given: Lu
    family: Cheng
  - given: Golnoosh
    family: Farnadi
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 96-118
  id: ganesh25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 96
  lastpage: 118
  published: 2025-04-23 00:00:00 +0000
- title: 'Improving Bias Metrics in Vision-Language Models by Addressing Inherent Model Disabilities'
  abstract: 'The integration of Vision-Language Models (VLMs) into various applications has high-lighted the importance of evaluating these models for inherent biases, especially alonggender and racial lines. Traditional bias assessment methods in VLMs typically rely onaccuracy metrics, assessing disparities in performance across different demographic groups.These methods, however, often overlook the impact of the model’s disabilities, like lack spa-tial reasoning, which may skew the bias assessment. In this work, we propose an approachthat systematically examines how current bias evaluation metrics account for the model’slimitations. We introduce two methods that circumvent these disabilities by integratingspatial guidance from textual and visual modalities. Our experiments aim to refine biasquantification by effectively mitigating the impact of spatial reasoning limitations, offeringa more accurate assessment of biases in VLMs.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/darur25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/darur25a/darur25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-darur25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Lakshmipathi Balaji
    family: Darur
  - given: Shanmukha Sai Keerthi
    family: Gouravarapu
  - given: Shashwat
    family: Goel
  - given: Ponnurangam
    family: Kumaraguru
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 119-132
  id: darur25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 119
  lastpage: 132
  published: 2025-04-23 00:00:00 +0000
- title: 'Multilingual Hallucination Gaps'
  abstract: 'Large language models (LLMs) are increasingly used as alternatives to traditional searchengines given their capacity to generate text that resembles human language. However, thisshift is concerning, as LLMs often generate hallucinations—misleading or false informationthat appears highly credible. In this study, we explore the phenomenon of hallucinationsacross multiple languages in free-form text generation, focusing on what we call multilingualhallucination gaps. These gaps reflect differences in the frequency of hallucinated answersdepending on the prompt and language used. To quantify such hallucinations, we used theFActScore metric and extended its framework to a multilingual setting. We conductedexperiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographiesin 19 languages and comparing the results to Wikipedia pages. Our results reveal varia-tions in hallucination rates, especially between high- and low-resource languages, raisingimportant questions about LLM multilingual performance and the challenges in evaluatinghallucinations in multilingual free-form text generation.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/chataigner25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/chataigner25a/chataigner25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-chataigner25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Cléa
    family: Chataigner
  - given: Afaf
    family: Taïk
  - given: Golnoosh
    family: Farnadi
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 133-155
  id: chataigner25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 133
  lastpage: 155
  published: 2025-04-23 00:00:00 +0000
- title: 'Fairness-Enhancing Data Augmentation Methods for Worst-Group Accuracy'
  abstract: 'Ensuring fair predictions across many distinct subpopulations in the training data canbe prohibitive for large models. Recently, simple linear last layer retraining strategies,in combination with data augmentation methods such as upweighting and downsamplinghave been shown to achieve state-of-the-art performance for worst-group accuracy, whichquantifies accuracy for the least prevalent subpopulation. For linear last layer retraining andthe abovementioned augmentations, we present a comparison of the optimal worst-groupaccuracy when modeling the distribution of the latent representations (input to the last layer)as Gaussian for each subpopulation. Observing that these augmentation techniques relyheavily on well-labeled subpopulations, we present a comparison of the optimal worst-groupaccuracy in the setting of label noise. We verify our results for both synthetic and largepublicly available datasets.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/welfert25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/welfert25a/welfert25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-welfert25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Monica
    family: Welfert
  - given: Nathan
    family: Stromberg
  - given: Lalitha
    family: Sankar
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 156-172
  id: welfert25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 156
  lastpage: 172
  published: 2025-04-23 00:00:00 +0000
- title: 'Privacy-Preserving Group Fairness in Cross-Device Federated Learning'
  abstract: 'Group fairness ensures that the outcome of machine learning (ML) based decision making systems are notbiased towards a certain group of people defined by a sensitive attribute such as gender or ethnicity. Achievinggroup fairness in Federated Learning (FL) is challenging because mitigating bias inherently requires usingthe sensitive attribute values of all clients, while FL is aimed precisely at protecting privacy by not givingaccess to the clients’ data. As we show in this paper, this conflict between fairness and privacy in FL can beresolved by combining FL with Secure Multiparty Computation (MPC) and Differential Privacy (DP). Tothis end, we propose a privacy-preserving approach to calculate group fairness notions in the cross-device FLsetting. Then, we propose two bias mitigation pre-processing and post-processing techniques in cross-deviceFL under formal privacy guarantees, without requiring the clients to disclose their sensitive attribute values.Empirical evaluations on real world datasets demonstrate the effectiveness of our solution to train fair andaccurate ML models in federated cross-device setups with privacy guarantees to the users.'
  volume: 279
  URL: https://proceedings.mlr.press/v279/pentyala25a.html
  PDF: https://raw.githubusercontent.com/mlresearch/v279/main/assets/pentyala25a/pentyala25a.pdf
  edit: https://github.com/mlresearch//v279/edit/gh-pages/_posts/2025-04-23-pentyala25a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Algorithmic Fairness Through the Lens of Metrics and Evaluation'
  publisher: 'PMLR'
  author: 
  - given: Sikha
    family: Pentyala
  - given: Nicola
    family: Neophytou
  - given: Anderson
    family: Nascimento
  - given: Martine
    family: De Cock
  - given: Golnoosh
    family: Farnadi
  editor: 
  - given: Miriam
    family: Rateike
  - given: Awa
    family: Dieng
  - given: Jamelle
    family: Watson-Daniels
  - given: Ferdinando
    family: Fioretto
  - given: Golnoosh
    family: Farnadi
  page: 173-198
  id: pentyala25a
  issued:
    date-parts: 
      - 2025
      - 4
      - 23
  firstpage: 173
  lastpage: 198
  published: 2025-04-23 00:00:00 +0000
