<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Proceedings of Machine Learning Research</title>
    <description>Proceedings of the Workshop on Algorithmic Fairness Through the Lens of Metrics and Evaluation
  Held in Vancouver Convention Center, Vancouver, Canada on 14 December 2024

Published as Volume 279 by the Proceedings of Machine Learning Research on 23 April 2025.

Volume Edited by:
  Miriam Rateike
  Awa Dieng
  Jamelle Watson-Daniels
  Ferdinando Fioretto
  Golnoosh Farnadi

Series Editors:
  Neil D. Lawrence
</description>
    <link>https://proceedings.mlr.press/v279/</link>
    <atom:link href="https://proceedings.mlr.press/v279/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Sun, 11 May 2025 17:34:18 +0000</pubDate>
    <lastBuildDate>Sun, 11 May 2025 17:34:18 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Fairness-Enhancing Data Augmentation Methods for Worst-Group Accuracy</title>
        <description>Ensuring fair predictions across many distinct subpopulations in the training data canbe prohibitive for large models. Recently, simple linear last layer retraining strategies,in combination with data augmentation methods such as upweighting and downsamplinghave been shown to achieve state-of-the-art performance for worst-group accuracy, whichquantifies accuracy for the least prevalent subpopulation. For linear last layer retraining andthe abovementioned augmentations, we present a comparison of the optimal worst-groupaccuracy when modeling the distribution of the latent representations (input to the last layer)as Gaussian for each subpopulation. Observing that these augmentation techniques relyheavily on well-labeled subpopulations, we present a comparison of the optimal worst-groupaccuracy in the setting of label noise. We verify our results for both synthetic and largepublicly available datasets.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/welfert25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/welfert25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Algorithmic Fairness Through the Lens of Metrics and Evaluation (AFME) 2024</title>
        <description></description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/rateike25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/rateike25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Better Bias Benchmarking of Language Models via Multi-factor Analysis</title>
        <description>Bias benchmarks are important ways to assess fairness and bias of language models (LMs),but the design methodology and metrics used in these benchmarks are typically ad hoc. Wepropose an approach for multi-factor analysis of LM bias benchmarks inspired by methodsfrom health informatics and experimental design. Given a benchmark, we first identifyexperimental factors of three types: domain factors that characterize the subject of theLM prompt, prompt factors that characterize how the prompt is formulated, and modelfactors that characterize the model and parameters used. We use coverage analysis tounderstand which biases the benchmark data examines with respect to these factors. Wethen use multi-factor analyses and metrics to understand the strengths and weaknesses ofthe LM on the benchmark. Prior benchmark analyses reached conclusions by comparingone to three factors at a time, typically using tables and heatmaps without principledmetrics and tests that consider the effects of many factors. We propose examining howthe interactions between factors contribute to bias and develop bias metrics across all sub-groups using subgroup analysis approaches inspired by clinical trial and machine learningfairness research. We illustrate these proposed methods by demonstrating how they yieldadditional insights on the benchmark SocialStigmaQA. We discuss opportunities to createmore effective, efficient, and reusable benchmarks with deeper insights by adopting moresystematic multi-factor experimental design, analysis, and metrics.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/powers25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/powers25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Privacy-Preserving Group Fairness in Cross-Device Federated Learning</title>
        <description>Group fairness ensures that the outcome of machine learning (ML) based decision making systems are notbiased towards a certain group of people defined by a sensitive attribute such as gender or ethnicity. Achievinggroup fairness in Federated Learning (FL) is challenging because mitigating bias inherently requires usingthe sensitive attribute values of all clients, while FL is aimed precisely at protecting privacy by not givingaccess to the clients’ data. As we show in this paper, this conflict between fairness and privacy in FL can beresolved by combining FL with Secure Multiparty Computation (MPC) and Differential Privacy (DP). Tothis end, we propose a privacy-preserving approach to calculate group fairness notions in the cross-device FLsetting. Then, we propose two bias mitigation pre-processing and post-processing techniques in cross-deviceFL under formal privacy guarantees, without requiring the clients to disclose their sensitive attribute values.Empirical evaluations on real world datasets demonstrate the effectiveness of our solution to train fair andaccurate ML models in federated cross-device setups with privacy guarantees to the users.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/pentyala25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/pentyala25a.html</guid>
        
        
      </item>
    
      <item>
        <title>From Models to Systems: A Comprehensive Framework for AI System Fairness in Compositional Recommender Systems</title>
        <description>Fairness research in machine learning often centers on ensuring equitable performance ofindividual models. However, real-world recommendation systems are built on multiplemodels and even multiple stages, from candidate retrieval to scoring and serving, whichraises challenges for responsible development and deployment. This AI system-level view,as highlighted by regulations like the EU AI Act, necessitates moving beyond auditingindividual models as independent entities. We propose a holistic framework for modelingAI system-level fairness, focusing on the end-utility delivered to diverse user groups, andconsider interactions between components such as retrieval and scoring models. We provideformal insights on the limitations of focusing solely on model-level fairness and highlight theneed for alternative tools that account for heterogeneity in user preferences. To mitigatesystem-level disparities, we adapt closed-box optimization tools (e.g., BayesOpt) to jointlyoptimize utility and equity. We empirically demonstrate the effectiveness of our proposedframework on synthetic and real datasets, underscoring the need for a framework thatreflects the design of modern, industrial AI systems.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/hsu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/hsu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>The Intersectionality Problem for Algorithmic Fairness</title>
        <description>A yet unmet challenge in algorithmic fairness is the problem of intersectionality, that is,achieving fairness across the intersection of multiple groups—and verifying that such fairnesshas been attained. Because intersectional groups tend to be small, verifying whether amodel is fair raises statistical as well as moral-methodological challenges. This paper (1)elucidates the problem of intersectionality in algorithmic fairness, (2) develops desiderata toclarify the challenges underlying the problem and guide the search for potential solutions,(3) illustrates the desiderata and potential solutions by sketching a proposal using simplehypothesis testing, and (4) evaluates, partly empirically, this proposal against the proposeddesiderata.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/himmelreich25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/himmelreich25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML</title>
        <description>With fairness concerns gaining significant attention in Machine Learning (ML), several biasmitigation techniques have been proposed, often compared against each other to find thebest method. These benchmarking efforts tend to use a common setup for evaluation underthe assumption that providing a uniform environment ensures a fair comparison. However,bias mitigation techniques are sensitive to hyperparameter choices, random seeds, featureselection, etc., meaning that comparison on just one setting can unfairly favour certainalgorithms. In this work, we show significant variance in fairness achieved by several al-gorithms and the influence of the learning pipeline on fairness scores. We highlight thatmost bias mitigation techniques can achieve comparable performance, given the freedomto perform hyperparameter optimization, suggesting that the choice of the evaluation pa-rameters—rather than the mitigation technique itself—can sometimes create the perceivedsuperiority of one method over another. We hope our work encourages future research onhow various choices in the lifecycle of developing an algorithm impact fairness, and trendsthat guide the selection of appropriate algorithms.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/ganesh25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/ganesh25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Improving Bias Metrics in Vision-Language Models by Addressing Inherent Model Disabilities</title>
        <description>The integration of Vision-Language Models (VLMs) into various applications has high-lighted the importance of evaluating these models for inherent biases, especially alonggender and racial lines. Traditional bias assessment methods in VLMs typically rely onaccuracy metrics, assessing disparities in performance across different demographic groups.These methods, however, often overlook the impact of the model’s disabilities, like lack spa-tial reasoning, which may skew the bias assessment. In this work, we propose an approachthat systematically examines how current bias evaluation metrics account for the model’slimitations. We introduce two methods that circumvent these disabilities by integratingspatial guidance from textual and visual modalities. Our experiments aim to refine biasquantification by effectively mitigating the impact of spatial reasoning limitations, offeringa more accurate assessment of biases in VLMs.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/darur25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/darur25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Multilingual Hallucination Gaps</title>
        <description>Large language models (LLMs) are increasingly used as alternatives to traditional searchengines given their capacity to generate text that resembles human language. However, thisshift is concerning, as LLMs often generate hallucinations—misleading or false informationthat appears highly credible. In this study, we explore the phenomenon of hallucinationsacross multiple languages in free-form text generation, focusing on what we call multilingualhallucination gaps. These gaps reflect differences in the frequency of hallucinated answersdepending on the prompt and language used. To quantify such hallucinations, we used theFActScore metric and extended its framework to a multilingual setting. We conductedexperiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographiesin 19 languages and comparing the results to Wikipedia pages. Our results reveal varia-tions in hallucination rates, especially between high- and low-resource languages, raisingimportant questions about LLM multilingual performance and the challenges in evaluatinghallucinations in multilingual free-form text generation.</description>
        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v279/chataigner25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v279/chataigner25a.html</guid>
        
        
      </item>
    
  </channel>
</rss>
