Facial Demography Analysis of the LAION Dataset

Iris Dominguez-Catena, Daniel Paternain, Mikel Galar
Proceedings of Fourth European Workshop on Algorithmic Fairness, PMLR 294:357-361, 2025.

Abstract

Large-scale image-text datasets have become fundamental building blocks for modern AI systems, raising concerns about the demographic biases they may encode and propagate. We present a comprehensive analysis of LAION, one of the largest and most influential datasets in this domain, focusing on demographic representation and intersectional biases across age, gender and race. Our methodology combines state-of-the-art face detection (RetinaFace) with specialized demographic classifiers (FairFace and EMO-AffectNet) to analyze a random sample of 500,000 image URLs from ReLAION-2B-en, yielding over 37,000 faces. We analyze both general representational biases, revealing severe overrepresentation of certain groups-such as white people and individuals aged 20-29-and intersectional biases, notably the underrepresentation of women over 30 years old and non-White infants. These results highlight the importance of considering not just individual demographic attributes, but their intersections when evaluating and mitigating bias in large-scale datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v294-dominguez-catena25a, title = {Facial Demography Analysis of the LAION Dataset}, author = {Dominguez-Catena, Iris and Paternain, Daniel and Galar, Mikel}, booktitle = {Proceedings of Fourth European Workshop on Algorithmic Fairness}, pages = {357--361}, year = {2025}, editor = {Weerts, Hilde and Pechenizkiy, Mykola and Allhutter, Doris and CorrĂȘa, Ana Maria and Grote, Thomas and Liem, Cynthia}, volume = {294}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--02 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v294/main/assets/dominguez-catena25a/dominguez-catena25a.pdf}, url = {https://proceedings.mlr.press/v294/dominguez-catena25a.html}, abstract = {Large-scale image-text datasets have become fundamental building blocks for modern AI systems, raising concerns about the demographic biases they may encode and propagate. We present a comprehensive analysis of LAION, one of the largest and most influential datasets in this domain, focusing on demographic representation and intersectional biases across age, gender and race. Our methodology combines state-of-the-art face detection (RetinaFace) with specialized demographic classifiers (FairFace and EMO-AffectNet) to analyze a random sample of 500,000 image URLs from ReLAION-2B-en, yielding over 37,000 faces. We analyze both general representational biases, revealing severe overrepresentation of certain groups-such as white people and individuals aged 20-29-and intersectional biases, notably the underrepresentation of women over 30 years old and non-White infants. These results highlight the importance of considering not just individual demographic attributes, but their intersections when evaluating and mitigating bias in large-scale datasets.} }
Endnote
%0 Conference Paper %T Facial Demography Analysis of the LAION Dataset %A Iris Dominguez-Catena %A Daniel Paternain %A Mikel Galar %B Proceedings of Fourth European Workshop on Algorithmic Fairness %C Proceedings of Machine Learning Research %D 2025 %E Hilde Weerts %E Mykola Pechenizkiy %E Doris Allhutter %E Ana Maria CorrĂȘa %E Thomas Grote %E Cynthia Liem %F pmlr-v294-dominguez-catena25a %I PMLR %P 357--361 %U https://proceedings.mlr.press/v294/dominguez-catena25a.html %V 294 %X Large-scale image-text datasets have become fundamental building blocks for modern AI systems, raising concerns about the demographic biases they may encode and propagate. We present a comprehensive analysis of LAION, one of the largest and most influential datasets in this domain, focusing on demographic representation and intersectional biases across age, gender and race. Our methodology combines state-of-the-art face detection (RetinaFace) with specialized demographic classifiers (FairFace and EMO-AffectNet) to analyze a random sample of 500,000 image URLs from ReLAION-2B-en, yielding over 37,000 faces. We analyze both general representational biases, revealing severe overrepresentation of certain groups-such as white people and individuals aged 20-29-and intersectional biases, notably the underrepresentation of women over 30 years old and non-White infants. These results highlight the importance of considering not just individual demographic attributes, but their intersections when evaluating and mitigating bias in large-scale datasets.
APA
Dominguez-Catena, I., Paternain, D. & Galar, M.. (2025). Facial Demography Analysis of the LAION Dataset. Proceedings of Fourth European Workshop on Algorithmic Fairness, in Proceedings of Machine Learning Research 294:357-361 Available from https://proceedings.mlr.press/v294/dominguez-catena25a.html.

Related Material