Detecting Biomedical Named Entities in COVID-19 Texts

Shaina Raza, Brian Schwartz
Proceedings of the 1st Workshop on Healthcare AI and COVID-19, ICML 2022, PMLR 184:117-126, 2022.

Abstract

The application of the state-of-the-art biomedical named entity recognition task faces a few challenges: first, these methods are trained on a fewer number of clinical entities (e.g., disease, symptom, proteins, genes); second, these methods require a large amount of data for pre-training and prediction, making it difficult to implement them in real-time scenarios; third, these methods do not consider the non-clinical entities such as social determinants of health (age, gender, employment, race) which are also related to patients’ health. We propose a Machine Learning (ML) pipeline that improves on previous efforts in three ways: first, it recognizes many clinical entity types (diseases, symptoms, drugs, diagnosis, etc.), second, this pipeline is easily configurable, reusable and can scale up for training and inference; third, it considers non-clinical factors related to patient’s health. At a high level, this pipeline consists of stages: pre-processing, tokenization, mapping embedding lookup and named entity recognition task. We also present a new dataset that we prepare by curating the COVID-19 case reports. The proposed approach outperforms baseline methods on four benchmark datasets with macro-and microaverage F1 scores around 90, as well as using our dataset with a macro-and micro-average F1 score of 95.25 and 93.18 respectively.

Cite this Paper


BibTeX
@InProceedings{pmlr-v184-raza22a, title = {Detecting Biomedical Named Entities in COVID-19 Texts}, author = {Raza, Shaina and Schwartz, Brian}, booktitle = {Proceedings of the 1st Workshop on Healthcare AI and COVID-19, ICML 2022}, pages = {117--126}, year = {2022}, editor = {Xu, Peng and Zhu, Tingting and Zhu, Pengkai and Clifton, David A. and Belgrave, Danielle and Zhang, Yuanting}, volume = {184}, series = {Proceedings of Machine Learning Research}, month = {22 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v184/raza22a/raza22a.pdf}, url = {https://proceedings.mlr.press/v184/raza22a.html}, abstract = {The application of the state-of-the-art biomedical named entity recognition task faces a few challenges: first, these methods are trained on a fewer number of clinical entities (e.g., disease, symptom, proteins, genes); second, these methods require a large amount of data for pre-training and prediction, making it difficult to implement them in real-time scenarios; third, these methods do not consider the non-clinical entities such as social determinants of health (age, gender, employment, race) which are also related to patients’ health. We propose a Machine Learning (ML) pipeline that improves on previous efforts in three ways: first, it recognizes many clinical entity types (diseases, symptoms, drugs, diagnosis, etc.), second, this pipeline is easily configurable, reusable and can scale up for training and inference; third, it considers non-clinical factors related to patient’s health. At a high level, this pipeline consists of stages: pre-processing, tokenization, mapping embedding lookup and named entity recognition task. We also present a new dataset that we prepare by curating the COVID-19 case reports. The proposed approach outperforms baseline methods on four benchmark datasets with macro-and microaverage F1 scores around 90, as well as using our dataset with a macro-and micro-average F1 score of 95.25 and 93.18 respectively.} }
Endnote
%0 Conference Paper %T Detecting Biomedical Named Entities in COVID-19 Texts %A Shaina Raza %A Brian Schwartz %B Proceedings of the 1st Workshop on Healthcare AI and COVID-19, ICML 2022 %C Proceedings of Machine Learning Research %D 2022 %E Peng Xu %E Tingting Zhu %E Pengkai Zhu %E David A. Clifton %E Danielle Belgrave %E Yuanting Zhang %F pmlr-v184-raza22a %I PMLR %P 117--126 %U https://proceedings.mlr.press/v184/raza22a.html %V 184 %X The application of the state-of-the-art biomedical named entity recognition task faces a few challenges: first, these methods are trained on a fewer number of clinical entities (e.g., disease, symptom, proteins, genes); second, these methods require a large amount of data for pre-training and prediction, making it difficult to implement them in real-time scenarios; third, these methods do not consider the non-clinical entities such as social determinants of health (age, gender, employment, race) which are also related to patients’ health. We propose a Machine Learning (ML) pipeline that improves on previous efforts in three ways: first, it recognizes many clinical entity types (diseases, symptoms, drugs, diagnosis, etc.), second, this pipeline is easily configurable, reusable and can scale up for training and inference; third, it considers non-clinical factors related to patient’s health. At a high level, this pipeline consists of stages: pre-processing, tokenization, mapping embedding lookup and named entity recognition task. We also present a new dataset that we prepare by curating the COVID-19 case reports. The proposed approach outperforms baseline methods on four benchmark datasets with macro-and microaverage F1 scores around 90, as well as using our dataset with a macro-and micro-average F1 score of 95.25 and 93.18 respectively.
APA
Raza, S. & Schwartz, B.. (2022). Detecting Biomedical Named Entities in COVID-19 Texts. Proceedings of the 1st Workshop on Healthcare AI and COVID-19, ICML 2022, in Proceedings of Machine Learning Research 184:117-126 Available from https://proceedings.mlr.press/v184/raza22a.html.

Related Material