An Agentic Approach to Phenotype Mapping from Rare Disease Surveys

Jipeng Di, Julie Renee Vaughn, Joshua Proulx, Sadie Nordstrand, Bryce Daines, Katrisa Madeline Ward, Philip J. Lupo, Jianhong Hu, Mullai Murugan, Adam W. Hansen
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1253-1268, 2026.

Abstract

Rare disease patients worldwide often experience years-long diagnostic delays, in part due to fragmented and unstructured phenotypic information. Patient-reported surveys provide valuable insights but are typically unstructured and hard to integrate with structured data. We present GenOMA (Geneial Ontology Mapping Agent), a Large Language Model ({LLM}) agent built on the LangGraph framework and integrated with a Unified Medical Language System ({UMLS}) {API} for precise extraction and ontology mapping of phenotypic terms. Using a modular, node-based architecture for context-aware extraction, iterative refinement, candidate ranking, and semantic validation, GenOMA maps data to standardized Human Phenotype Ontology ({HPO}) codes without local ontology deployment. We evaluate GenOMA on the question fields of three rare disease surveys, mapping them to {HPO} terms, and compare its performance with other leading methods. On the Xia-Gibbs Syndrome ({XGS}) Registry, GenOMA achieved 0.92 accuracy, 0.94 precision, 0.97 recall, and 0.96 F1. On the Down Syndrome Phenotyping Acute Leukemia Study ({DS-PALS}) dataset, it obtained 0.92 accuracy, 0.93 precision, 0.98 recall, and 0.96 F1. Finally, on the GenomeConnect ({GC}) dataset, it obtained 0.91 accuracy, 0.91 precision, 1.0 recall, and 0.96 F1. In all tasks, GenOMA outperformed MetaMap, PhenoTagger, PhenoBERT, cTAKES, and {GPT-5}. These results show that GenOMA effectively converts unstructured survey data to structured phenotype information. To our knowledge, this is the first ontology mapping system specifically designed for patient-reported rare disease surveys, a critical but underexplored data modality.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-di26a, title = {An Agentic Approach to Phenotype Mapping from Rare Disease Surveys}, author = {Di, Jipeng and Vaughn, Julie Renee and Proulx, Joshua and Nordstrand, Sadie and Daines, Bryce and Ward, Katrisa Madeline and Lupo, Philip J. and Hu, Jianhong and Murugan, Mullai and Hansen, Adam W.}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {1253--1268}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/di26a/di26a.pdf}, url = {https://proceedings.mlr.press/v297/di26a.html}, abstract = {Rare disease patients worldwide often experience years-long diagnostic delays, in part due to fragmented and unstructured phenotypic information. Patient-reported surveys provide valuable insights but are typically unstructured and hard to integrate with structured data. We present GenOMA (Geneial Ontology Mapping Agent), a Large Language Model ({LLM}) agent built on the LangGraph framework and integrated with a Unified Medical Language System ({UMLS}) {API} for precise extraction and ontology mapping of phenotypic terms. Using a modular, node-based architecture for context-aware extraction, iterative refinement, candidate ranking, and semantic validation, GenOMA maps data to standardized Human Phenotype Ontology ({HPO}) codes without local ontology deployment. We evaluate GenOMA on the question fields of three rare disease surveys, mapping them to {HPO} terms, and compare its performance with other leading methods. On the Xia-Gibbs Syndrome ({XGS}) Registry, GenOMA achieved 0.92 accuracy, 0.94 precision, 0.97 recall, and 0.96 F1. On the Down Syndrome Phenotyping Acute Leukemia Study ({DS-PALS}) dataset, it obtained 0.92 accuracy, 0.93 precision, 0.98 recall, and 0.96 F1. Finally, on the GenomeConnect ({GC}) dataset, it obtained 0.91 accuracy, 0.91 precision, 1.0 recall, and 0.96 F1. In all tasks, GenOMA outperformed MetaMap, PhenoTagger, PhenoBERT, cTAKES, and {GPT-5}. These results show that GenOMA effectively converts unstructured survey data to structured phenotype information. To our knowledge, this is the first ontology mapping system specifically designed for patient-reported rare disease surveys, a critical but underexplored data modality.} }
Endnote
%0 Conference Paper %T An Agentic Approach to Phenotype Mapping from Rare Disease Surveys %A Jipeng Di %A Julie Renee Vaughn %A Joshua Proulx %A Sadie Nordstrand %A Bryce Daines %A Katrisa Madeline Ward %A Philip J. Lupo %A Jianhong Hu %A Mullai Murugan %A Adam W. Hansen %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-di26a %I PMLR %P 1253--1268 %U https://proceedings.mlr.press/v297/di26a.html %V 297 %X Rare disease patients worldwide often experience years-long diagnostic delays, in part due to fragmented and unstructured phenotypic information. Patient-reported surveys provide valuable insights but are typically unstructured and hard to integrate with structured data. We present GenOMA (Geneial Ontology Mapping Agent), a Large Language Model ({LLM}) agent built on the LangGraph framework and integrated with a Unified Medical Language System ({UMLS}) {API} for precise extraction and ontology mapping of phenotypic terms. Using a modular, node-based architecture for context-aware extraction, iterative refinement, candidate ranking, and semantic validation, GenOMA maps data to standardized Human Phenotype Ontology ({HPO}) codes without local ontology deployment. We evaluate GenOMA on the question fields of three rare disease surveys, mapping them to {HPO} terms, and compare its performance with other leading methods. On the Xia-Gibbs Syndrome ({XGS}) Registry, GenOMA achieved 0.92 accuracy, 0.94 precision, 0.97 recall, and 0.96 F1. On the Down Syndrome Phenotyping Acute Leukemia Study ({DS-PALS}) dataset, it obtained 0.92 accuracy, 0.93 precision, 0.98 recall, and 0.96 F1. Finally, on the GenomeConnect ({GC}) dataset, it obtained 0.91 accuracy, 0.91 precision, 1.0 recall, and 0.96 F1. In all tasks, GenOMA outperformed MetaMap, PhenoTagger, PhenoBERT, cTAKES, and {GPT-5}. These results show that GenOMA effectively converts unstructured survey data to structured phenotype information. To our knowledge, this is the first ontology mapping system specifically designed for patient-reported rare disease surveys, a critical but underexplored data modality.
APA
Di, J., Vaughn, J.R., Proulx, J., Nordstrand, S., Daines, B., Ward, K.M., Lupo, P.J., Hu, J., Murugan, M. & Hansen, A.W.. (2026). An Agentic Approach to Phenotype Mapping from Rare Disease Surveys. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1253-1268 Available from https://proceedings.mlr.press/v297/di26a.html.

Related Material