Hashing2Vec: Fast Embedding Generation for SARS-CoV-2 Spike Sequence Classification

Murad Taslim, Chourasia Prakash, Ali Sarwan, Patterson Murray
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:754-769, 2023.

Abstract

Due to the ongoing coronavirus (COVID-19) pandemic, an unprecedented amount of SARS-CoV-2 sequence data is available. The scale of this data has out-paced traditional methods for its analysis, while machine-learning approaches aimed at clustering and classification of SARS-CoV-2 variants is becoming an attractive alternative. Since the SARS-CoV-2 genome is highly dimensional, considering the much smaller spike region can save a great deal of processing. As the spike protein mediates the attachment of the coronavirus to the host cell, most of the newer and more contagious variants can be characterized by alterations to the spike protein; hence it is often sufficient for characterizing the different SARS-CoV-2 variants. Another important consideration is to have a fast feature embedding generation, which is the subject of this work. Applying any machine learning (ML) model to a biological sequence requires first transforming it into a fixed-length (numerical) form. While there exist several compact embeddings for SARS-CoV-2 spike protein sequences, the generation process is computationally expensive since the features, added to the resulting vectors, are indexed in a naïve fashion. To solve this problem, we propose a fast and alignment-free hashing-based approach to design a fixed-length feature embedding for spike protein sequences, called Hashing2Vec, which can be used as input to any standard ML model. Using real-world data, we show that the proposed embedding is not only efficient to compute but also outperforms current state-of-the-art embedding methods in terms of classification accuracy. In terms of runtime, we achieve up to a 99.8% improvement in the Hashing2Vec-based embedding generation as compared to the baselines on a set of 7K spike amino acid sequences. It also outperforms the baselines on this data in terms of predictive performance and achieves accuracy and ROC-AUC scores of 86% and 84.4%, respectively.

Cite this Paper


BibTeX
@InProceedings{pmlr-v189-taslim23a, title = {Hashing2Vec: Fast Embedding Generation for SARS-CoV-2 Spike Sequence Classification}, author = {Taslim, Murad and Prakash, Chourasia and Sarwan, Ali and Murray, Patterson}, booktitle = {Proceedings of The 14th Asian Conference on Machine Learning}, pages = {754--769}, year = {2023}, editor = {Khan, Emtiyaz and Gonen, Mehmet}, volume = {189}, series = {Proceedings of Machine Learning Research}, month = {12--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v189/taslim23a/taslim23a.pdf}, url = {https://proceedings.mlr.press/v189/taslim23a.html}, abstract = { Due to the ongoing coronavirus (COVID-19) pandemic, an unprecedented amount of SARS-CoV-2 sequence data is available. The scale of this data has out-paced traditional methods for its analysis, while machine-learning approaches aimed at clustering and classification of SARS-CoV-2 variants is becoming an attractive alternative. Since the SARS-CoV-2 genome is highly dimensional, considering the much smaller spike region can save a great deal of processing. As the spike protein mediates the attachment of the coronavirus to the host cell, most of the newer and more contagious variants can be characterized by alterations to the spike protein; hence it is often sufficient for characterizing the different SARS-CoV-2 variants. Another important consideration is to have a fast feature embedding generation, which is the subject of this work. Applying any machine learning (ML) model to a biological sequence requires first transforming it into a fixed-length (numerical) form. While there exist several compact embeddings for SARS-CoV-2 spike protein sequences, the generation process is computationally expensive since the features, added to the resulting vectors, are indexed in a naïve fashion. To solve this problem, we propose a fast and alignment-free hashing-based approach to design a fixed-length feature embedding for spike protein sequences, called Hashing2Vec, which can be used as input to any standard ML model. Using real-world data, we show that the proposed embedding is not only efficient to compute but also outperforms current state-of-the-art embedding methods in terms of classification accuracy. In terms of runtime, we achieve up to a 99.8% improvement in the Hashing2Vec-based embedding generation as compared to the baselines on a set of 7K spike amino acid sequences. It also outperforms the baselines on this data in terms of predictive performance and achieves accuracy and ROC-AUC scores of 86% and 84.4%, respectively.} }
Endnote
%0 Conference Paper %T Hashing2Vec: Fast Embedding Generation for SARS-CoV-2 Spike Sequence Classification %A Murad Taslim %A Chourasia Prakash %A Ali Sarwan %A Patterson Murray %B Proceedings of The 14th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Emtiyaz Khan %E Mehmet Gonen %F pmlr-v189-taslim23a %I PMLR %P 754--769 %U https://proceedings.mlr.press/v189/taslim23a.html %V 189 %X Due to the ongoing coronavirus (COVID-19) pandemic, an unprecedented amount of SARS-CoV-2 sequence data is available. The scale of this data has out-paced traditional methods for its analysis, while machine-learning approaches aimed at clustering and classification of SARS-CoV-2 variants is becoming an attractive alternative. Since the SARS-CoV-2 genome is highly dimensional, considering the much smaller spike region can save a great deal of processing. As the spike protein mediates the attachment of the coronavirus to the host cell, most of the newer and more contagious variants can be characterized by alterations to the spike protein; hence it is often sufficient for characterizing the different SARS-CoV-2 variants. Another important consideration is to have a fast feature embedding generation, which is the subject of this work. Applying any machine learning (ML) model to a biological sequence requires first transforming it into a fixed-length (numerical) form. While there exist several compact embeddings for SARS-CoV-2 spike protein sequences, the generation process is computationally expensive since the features, added to the resulting vectors, are indexed in a naïve fashion. To solve this problem, we propose a fast and alignment-free hashing-based approach to design a fixed-length feature embedding for spike protein sequences, called Hashing2Vec, which can be used as input to any standard ML model. Using real-world data, we show that the proposed embedding is not only efficient to compute but also outperforms current state-of-the-art embedding methods in terms of classification accuracy. In terms of runtime, we achieve up to a 99.8% improvement in the Hashing2Vec-based embedding generation as compared to the baselines on a set of 7K spike amino acid sequences. It also outperforms the baselines on this data in terms of predictive performance and achieves accuracy and ROC-AUC scores of 86% and 84.4%, respectively.
APA
Taslim, M., Prakash, C., Sarwan, A. & Murray, P.. (2023). Hashing2Vec: Fast Embedding Generation for SARS-CoV-2 Spike Sequence Classification. Proceedings of The 14th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 189:754-769 Available from https://proceedings.mlr.press/v189/taslim23a.html.

Related Material