[edit]
PhenoRAG: Retrieval-Augmented Generation for Efficient Zero-Shot Phenotype Identification in Clinical Reports
Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025.
Abstract
Accurate extraction of phenotypic information from clinical narratives is essential in diagnostic medicine, yet mapping free-text reports to structured Human Phenotype Ontology (HPO) terms remains challenging. While encoder-only transformer models and small decoder-only generative models are attractive for clinical deployment due to their efficiency and low resource requirements, the former often fail to capture the rich context of clinical texts, and the latter struggle to process lengthy reports effectively. In contrast, larger language models excel at contextual understanding but are impractical for clinical use due to their size, propensity to hallucinate, and privacy concerns associated with non-local inference. To overcome these challenges, we introduce PhenoRAG, a novel retrieval-augmented generation framework that leverages a synthetic database of contextually enriched sentences to augment a lightweight decoder-only model for accurate zero-shot phenotype identification. We demonstrate the capacity of PhenoRAG to capture nuanced contextual clues by 1) evaluating its ability to perform two clinically relevant tasks—guide rare disease diagnosis and facilitate urinary tract infection detection—and 2) validating its performance on a synthetic dataset designed to mimic the challenges of real clinical narratives. Experimental results demonstrate that our lightweight PhenoRAG framework achieves a higher F1-score than both encoder-only transformers and standalone small language models, driven primarily by its high recall. These findings underscore the potential of PhenoRAG as a ready-to-use clinical tool for phenotype identification.