Evaluating Gender Bias in Large Language Models in Academic Knowledge Production

Judit Herman, Kira Diana Kovacs, Yaije Wang, Orsolya Vasarhelyi
Proceedings of Fourth European Workshop on Algorithmic Fairness, PMLR 294:417-422, 2025.

Abstract

Gender inequality in science is a complex issue that affects every stage of a scientific career, from education to professional advancement. Despite progress in recent decades, women remain underrepresented in most scientific fields, particularly in leadership roles and prestigious research positions. Generative AI holds promise for addressing long-standing inequalities in academia, such as assisting non-native English speakers in articulating their scientific discoveries more clearly and efficiently. Additionally, generative AI trained on scientific datasets could produce less biased and more comprehensive literature reviews. However, large language models (LLMs) have been shown to exhibit biases, failing to represent men and women equally in image generation. They also generate factually incorrect responses and fabricate non-existent references. In this paper, we examine references generated by the ChatGPT-4o model across 26 research areas within four main domains: Physical Sciences, Health Sciences, Social Sciences, and Life Sciences Specifically, we designed a prompt that instructed ChatGPT to generate literature reviews in various research topics and provide references including authors’ full names, article titles, journals, publication years, and DOIs. We then compared these references across research areas to OpenAlex, an open-source database containing over 250 million scientific publications . Our study showed that ChatGPT-4o tends to cite more recent publications and papers with a higher ratio of women authors. It also highlights a high hallucination rate in generated citations, with ChatGPT-4o showing no significant improvement over previous versions, emphasizing the need for a critical evaluation of AI-generated references. The 2024 AI Index Report published by Stanford University stated that one of the key challenges in responsible AI development is the lack of standardization in AI reporting, benchmarks, and evaluation frameworks. Our ongoing research aims to develop a standardized framework for evaluating LLMs in academic knowledge production by systematically comparing AI-generated literature reviews with real databases of published work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v294-herman25a, title = {Evaluating Gender Bias in Large Language Models in Academic Knowledge Production}, author = {Herman, Judit and Kovacs, Kira Diana and Wang, Yaije and Vasarhelyi, Orsolya}, booktitle = {Proceedings of Fourth European Workshop on Algorithmic Fairness}, pages = {417--422}, year = {2025}, editor = {Weerts, Hilde and Pechenizkiy, Mykola and Allhutter, Doris and Corrêa, Ana Maria and Grote, Thomas and Liem, Cynthia}, volume = {294}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--02 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v294/main/assets/herman25a/herman25a.pdf}, url = {https://proceedings.mlr.press/v294/herman25a.html}, abstract = {Gender inequality in science is a complex issue that affects every stage of a scientific career, from education to professional advancement. Despite progress in recent decades, women remain underrepresented in most scientific fields, particularly in leadership roles and prestigious research positions. Generative AI holds promise for addressing long-standing inequalities in academia, such as assisting non-native English speakers in articulating their scientific discoveries more clearly and efficiently. Additionally, generative AI trained on scientific datasets could produce less biased and more comprehensive literature reviews. However, large language models (LLMs) have been shown to exhibit biases, failing to represent men and women equally in image generation. They also generate factually incorrect responses and fabricate non-existent references. In this paper, we examine references generated by the ChatGPT-4o model across 26 research areas within four main domains: Physical Sciences, Health Sciences, Social Sciences, and Life Sciences Specifically, we designed a prompt that instructed ChatGPT to generate literature reviews in various research topics and provide references including authors’ full names, article titles, journals, publication years, and DOIs. We then compared these references across research areas to OpenAlex, an open-source database containing over 250 million scientific publications . Our study showed that ChatGPT-4o tends to cite more recent publications and papers with a higher ratio of women authors. It also highlights a high hallucination rate in generated citations, with ChatGPT-4o showing no significant improvement over previous versions, emphasizing the need for a critical evaluation of AI-generated references. The 2024 AI Index Report published by Stanford University stated that one of the key challenges in responsible AI development is the lack of standardization in AI reporting, benchmarks, and evaluation frameworks. Our ongoing research aims to develop a standardized framework for evaluating LLMs in academic knowledge production by systematically comparing AI-generated literature reviews with real databases of published work.} }
Endnote
%0 Conference Paper %T Evaluating Gender Bias in Large Language Models in Academic Knowledge Production %A Judit Herman %A Kira Diana Kovacs %A Yaije Wang %A Orsolya Vasarhelyi %B Proceedings of Fourth European Workshop on Algorithmic Fairness %C Proceedings of Machine Learning Research %D 2025 %E Hilde Weerts %E Mykola Pechenizkiy %E Doris Allhutter %E Ana Maria Corrêa %E Thomas Grote %E Cynthia Liem %F pmlr-v294-herman25a %I PMLR %P 417--422 %U https://proceedings.mlr.press/v294/herman25a.html %V 294 %X Gender inequality in science is a complex issue that affects every stage of a scientific career, from education to professional advancement. Despite progress in recent decades, women remain underrepresented in most scientific fields, particularly in leadership roles and prestigious research positions. Generative AI holds promise for addressing long-standing inequalities in academia, such as assisting non-native English speakers in articulating their scientific discoveries more clearly and efficiently. Additionally, generative AI trained on scientific datasets could produce less biased and more comprehensive literature reviews. However, large language models (LLMs) have been shown to exhibit biases, failing to represent men and women equally in image generation. They also generate factually incorrect responses and fabricate non-existent references. In this paper, we examine references generated by the ChatGPT-4o model across 26 research areas within four main domains: Physical Sciences, Health Sciences, Social Sciences, and Life Sciences Specifically, we designed a prompt that instructed ChatGPT to generate literature reviews in various research topics and provide references including authors’ full names, article titles, journals, publication years, and DOIs. We then compared these references across research areas to OpenAlex, an open-source database containing over 250 million scientific publications . Our study showed that ChatGPT-4o tends to cite more recent publications and papers with a higher ratio of women authors. It also highlights a high hallucination rate in generated citations, with ChatGPT-4o showing no significant improvement over previous versions, emphasizing the need for a critical evaluation of AI-generated references. The 2024 AI Index Report published by Stanford University stated that one of the key challenges in responsible AI development is the lack of standardization in AI reporting, benchmarks, and evaluation frameworks. Our ongoing research aims to develop a standardized framework for evaluating LLMs in academic knowledge production by systematically comparing AI-generated literature reviews with real databases of published work.
APA
Herman, J., Kovacs, K.D., Wang, Y. & Vasarhelyi, O.. (2025). Evaluating Gender Bias in Large Language Models in Academic Knowledge Production. Proceedings of Fourth European Workshop on Algorithmic Fairness, in Proceedings of Machine Learning Research 294:417-422 Available from https://proceedings.mlr.press/v294/herman25a.html.

Related Material