Automated LOINC Standardization Using Pre-trained Large Language Models

Tao Tu, Eric Loreaux, Emma Chesley, Adam D. Lelkes, Paul Gamble, Mathias Bellaiche, Martin Seneviratne, Ming-Jun Chen
Proceedings of the 2nd Machine Learning for Health symposium, PMLR 193:343-355, 2022.

Abstract

Harmonization of local source concepts to standard clinical terminologies is a prerequisite for multi-center data aggregation and sharing. Challenges in automating the mapping process stem from the idiosyncratic source encoding schemes adopted by different health systems and the lack of large publicly available training data. In this study, we aim to develop a scalable and generalizable machine learning tool to facilitate standardizing laboratory observations to the Logical Observation Identifiers Names and Codes (LOINC). Specifically, we leverage the contextual embedding from pre-trained T5 models and propose a two-stage fine-tuning strategy based on contrastive learning to enable learning in a few-shot setting without manual feature engineering. Our method utilizes unlabeled general LOINC ontology and data augmentation to achieve high accuracy on retrieving the most relevant LOINC targets when limited amount of labeled data are available. We further show that our model generalizes well to unseen targets. Taken together, our approach shows great potential to reduce manual effort in LOINC standardization and can be easily extended to mapping other terminologies.

Cite this Paper


BibTeX
@InProceedings{pmlr-v193-tu22a, title = {Automated LOINC Standardization Using Pre-trained Large Language Models}, author = {Tu, Tao and Loreaux, Eric and Chesley, Emma and Lelkes, Adam D. and Gamble, Paul and Bellaiche, Mathias and Seneviratne, Martin and Chen, Ming-Jun}, booktitle = {Proceedings of the 2nd Machine Learning for Health symposium}, pages = {343--355}, year = {2022}, editor = {Parziale, Antonio and Agrawal, Monica and Joshi, Shalmali and Chen, Irene Y. and Tang, Shengpu and Oala, Luis and Subbaswamy, Adarsh}, volume = {193}, series = {Proceedings of Machine Learning Research}, month = {28 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v193/tu22a/tu22a.pdf}, url = {https://proceedings.mlr.press/v193/tu22a.html}, abstract = {Harmonization of local source concepts to standard clinical terminologies is a prerequisite for multi-center data aggregation and sharing. Challenges in automating the mapping process stem from the idiosyncratic source encoding schemes adopted by different health systems and the lack of large publicly available training data. In this study, we aim to develop a scalable and generalizable machine learning tool to facilitate standardizing laboratory observations to the Logical Observation Identifiers Names and Codes (LOINC). Specifically, we leverage the contextual embedding from pre-trained T5 models and propose a two-stage fine-tuning strategy based on contrastive learning to enable learning in a few-shot setting without manual feature engineering. Our method utilizes unlabeled general LOINC ontology and data augmentation to achieve high accuracy on retrieving the most relevant LOINC targets when limited amount of labeled data are available. We further show that our model generalizes well to unseen targets. Taken together, our approach shows great potential to reduce manual effort in LOINC standardization and can be easily extended to mapping other terminologies.} }
Endnote
%0 Conference Paper %T Automated LOINC Standardization Using Pre-trained Large Language Models %A Tao Tu %A Eric Loreaux %A Emma Chesley %A Adam D. Lelkes %A Paul Gamble %A Mathias Bellaiche %A Martin Seneviratne %A Ming-Jun Chen %B Proceedings of the 2nd Machine Learning for Health symposium %C Proceedings of Machine Learning Research %D 2022 %E Antonio Parziale %E Monica Agrawal %E Shalmali Joshi %E Irene Y. Chen %E Shengpu Tang %E Luis Oala %E Adarsh Subbaswamy %F pmlr-v193-tu22a %I PMLR %P 343--355 %U https://proceedings.mlr.press/v193/tu22a.html %V 193 %X Harmonization of local source concepts to standard clinical terminologies is a prerequisite for multi-center data aggregation and sharing. Challenges in automating the mapping process stem from the idiosyncratic source encoding schemes adopted by different health systems and the lack of large publicly available training data. In this study, we aim to develop a scalable and generalizable machine learning tool to facilitate standardizing laboratory observations to the Logical Observation Identifiers Names and Codes (LOINC). Specifically, we leverage the contextual embedding from pre-trained T5 models and propose a two-stage fine-tuning strategy based on contrastive learning to enable learning in a few-shot setting without manual feature engineering. Our method utilizes unlabeled general LOINC ontology and data augmentation to achieve high accuracy on retrieving the most relevant LOINC targets when limited amount of labeled data are available. We further show that our model generalizes well to unseen targets. Taken together, our approach shows great potential to reduce manual effort in LOINC standardization and can be easily extended to mapping other terminologies.
APA
Tu, T., Loreaux, E., Chesley, E., Lelkes, A.D., Gamble, P., Bellaiche, M., Seneviratne, M. & Chen, M.. (2022). Automated LOINC Standardization Using Pre-trained Large Language Models. Proceedings of the 2nd Machine Learning for Health symposium, in Proceedings of Machine Learning Research 193:343-355 Available from https://proceedings.mlr.press/v193/tu22a.html.

Related Material