[edit]
An Agentic Approach to Phenotype Mapping from Rare Disease Surveys
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1253-1268, 2026.
Abstract
Rare disease patients worldwide often experience years-long diagnostic delays, in part due to fragmented and unstructured phenotypic information. Patient-reported surveys provide valuable insights but are typically unstructured and hard to integrate with structured data. We present GenOMA (Geneial Ontology Mapping Agent), a Large Language Model ({LLM}) agent built on the LangGraph framework and integrated with a Unified Medical Language System ({UMLS}) {API} for precise extraction and ontology mapping of phenotypic terms. Using a modular, node-based architecture for context-aware extraction, iterative refinement, candidate ranking, and semantic validation, GenOMA maps data to standardized Human Phenotype Ontology ({HPO}) codes without local ontology deployment. We evaluate GenOMA on the question fields of three rare disease surveys, mapping them to {HPO} terms, and compare its performance with other leading methods. On the Xia-Gibbs Syndrome ({XGS}) Registry, GenOMA achieved 0.92 accuracy, 0.94 precision, 0.97 recall, and 0.96 F1. On the Down Syndrome Phenotyping Acute Leukemia Study ({DS-PALS}) dataset, it obtained 0.92 accuracy, 0.93 precision, 0.98 recall, and 0.96 F1. Finally, on the GenomeConnect ({GC}) dataset, it obtained 0.91 accuracy, 0.91 precision, 1.0 recall, and 0.96 F1. In all tasks, GenOMA outperformed MetaMap, PhenoTagger, PhenoBERT, cTAKES, and {GPT-5}. These results show that GenOMA effectively converts unstructured survey data to structured phenotype information. To our knowledge, this is the first ontology mapping system specifically designed for patient-reported rare disease surveys, a critical but underexplored data modality.