[edit]
Evaluating Gender Bias in Large Language Models in Academic Knowledge Production
Proceedings of Fourth European Workshop on Algorithmic Fairness, PMLR 294:417-422, 2025.
Abstract
Gender inequality in science is a complex issue that affects every stage of a scientific career, from education to professional advancement. Despite progress in recent decades, women remain underrepresented in most scientific fields, particularly in leadership roles and prestigious research positions. Generative AI holds promise for addressing long-standing inequalities in academia, such as assisting non-native English speakers in articulating their scientific discoveries more clearly and efficiently. Additionally, generative AI trained on scientific datasets could produce less biased and more comprehensive literature reviews. However, large language models (LLMs) have been shown to exhibit biases, failing to represent men and women equally in image generation. They also generate factually incorrect responses and fabricate non-existent references. In this paper, we examine references generated by the ChatGPT-4o model across 26 research areas within four main domains: Physical Sciences, Health Sciences, Social Sciences, and Life Sciences Specifically, we designed a prompt that instructed ChatGPT to generate literature reviews in various research topics and provide references including authors’ full names, article titles, journals, publication years, and DOIs. We then compared these references across research areas to OpenAlex, an open-source database containing over 250 million scientific publications . Our study showed that ChatGPT-4o tends to cite more recent publications and papers with a higher ratio of women authors. It also highlights a high hallucination rate in generated citations, with ChatGPT-4o showing no significant improvement over previous versions, emphasizing the need for a critical evaluation of AI-generated references. The 2024 AI Index Report published by Stanford University stated that one of the key challenges in responsible AI development is the lack of standardization in AI reporting, benchmarks, and evaluation frameworks. Our ongoing research aims to develop a standardized framework for evaluating LLMs in academic knowledge production by systematically comparing AI-generated literature reviews with real databases of published work.