Hashing2Vec: Fast Embedding Generation for
 SARS-CoV-2 Spike Sequence Classification

Murad Taslim; Chourasia Prakash; Ali Sarwan; Patterson Murray

Hashing2Vec: Fast Embedding Generation for SARS-CoV-2 Spike Sequence Classification

Murad Taslim, Chourasia Prakash, Ali Sarwan, Patterson Murray

Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:754-769, 2023.

Abstract

Due to the ongoing coronavirus (COVID-19) pandemic, an unprecedented amount of SARS-CoV-2 sequence data is available. The scale of this data has out-paced traditional methods for its analysis, while machine-learning approaches aimed at clustering and classification of SARS-CoV-2 variants is becoming an attractive alternative. Since the SARS-CoV-2 genome is highly dimensional, considering the much smaller spike region can save a great deal of processing. As the spike protein mediates the attachment of the coronavirus to the host cell, most of the newer and more contagious variants can be characterized by alterations to the spike protein; hence it is often sufficient for characterizing the different SARS-CoV-2 variants. Another important consideration is to have a fast feature embedding generation, which is the subject of this work. Applying any machine learning (ML) model to a biological sequence requires first transforming it into a fixed-length (numerical) form. While there exist several compact embeddings for SARS-CoV-2 spike protein sequences, the generation process is computationally expensive since the features, added to the resulting vectors, are indexed in a naïve fashion. To solve this problem, we propose a fast and alignment-free hashing-based approach to design a fixed-length feature embedding for spike protein sequences, called Hashing2Vec, which can be used as input to any standard ML model. Using real-world data, we show that the proposed embedding is not only efficient to compute but also outperforms current state-of-the-art embedding methods in terms of classification accuracy. In terms of runtime, we achieve up to a 99.8% improvement in the Hashing2Vec-based embedding generation as compared to the baselines on a set of 7K spike amino acid sequences. It also outperforms the baselines on this data in terms of predictive performance and achieves accuracy and ROC-AUC scores of 86% and 84.4%, respectively.

Cite this Paper

BibTeX


@InProceedings{pmlr-v189-taslim23a,
  title = 	 {Hashing2Vec: Fast Embedding Generation for
 SARS-CoV-2 Spike Sequence Classification},
  author =       {Taslim, Murad and Prakash, Chourasia and Sarwan, Ali and Murray, Patterson},
  booktitle = 	 {Proceedings of The 14th Asian Conference on Machine
 Learning},
  pages = 	 {754--769},
  year = 	 {2023},
  editor = 	 {Khan, Emtiyaz and Gonen, Mehmet},
  volume = 	 {189},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v189/taslim23a/taslim23a.pdf},
  url = 	 {https://proceedings.mlr.press/v189/taslim23a.html},
  abstract = 	 { Due to the ongoing coronavirus (COVID-19) pandemic,
 an unprecedented amount of SARS-CoV-2 sequence data
 is available. The scale of this data has out-paced
 traditional methods for its analysis, while
 machine-learning approaches aimed at clustering and
 classification of SARS-CoV-2 variants is becoming an
 attractive alternative. Since the SARS-CoV-2 genome
 is highly dimensional, considering the much smaller
 spike region can save a great deal of processing.
 As the spike protein mediates the attachment of the
 coronavirus to the host cell, most of the newer and
 more contagious variants can be characterized by
 alterations to the spike protein; hence it is often
 sufficient for characterizing the different
 SARS-CoV-2 variants. Another important consideration
 is to have a fast feature embedding generation,
 which is the subject of this work.  Applying any
 machine learning (ML) model to a biological sequence
 requires first transforming it into a fixed-length
 (numerical) form. While there exist several compact
 embeddings for SARS-CoV-2 spike protein sequences,
 the generation process is computationally expensive
 since the features, added to the resulting vectors,
 are indexed in a naïve fashion.  To solve this
 problem, we propose a fast and alignment-free
 hashing-based approach to design a fixed-length
 feature embedding for spike protein sequences,
 called Hashing2Vec, which can be used as input to
 any standard ML model. Using real-world data, we
 show that the proposed embedding is not only
 efficient to compute but also outperforms current
 state-of-the-art embedding methods in terms of
 classification accuracy. In terms of runtime, we
 achieve up to a 99.8% improvement in the
 Hashing2Vec-based embedding generation as compared
 to the baselines on a set of 7K spike amino acid
 sequences. It also outperforms the baselines on this
 data in terms of predictive performance and achieves
 accuracy and ROC-AUC scores of 86% and 84.4%,
 respectively.}
}

Endnote

%0 Conference Paper
%T Hashing2Vec: Fast Embedding Generation for
 SARS-CoV-2 Spike Sequence Classification
%A Murad Taslim
%A Chourasia Prakash
%A Ali Sarwan
%A Patterson Murray
%B Proceedings of The 14th Asian Conference on Machine
 Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Emtiyaz Khan
%E Mehmet Gonen	
%F pmlr-v189-taslim23a
%I PMLR
%P 754--769
%U https://proceedings.mlr.press/v189/taslim23a.html
%V 189
%X  Due to the ongoing coronavirus (COVID-19) pandemic,
 an unprecedented amount of SARS-CoV-2 sequence data
 is available. The scale of this data has out-paced
 traditional methods for its analysis, while
 machine-learning approaches aimed at clustering and
 classification of SARS-CoV-2 variants is becoming an
 attractive alternative. Since the SARS-CoV-2 genome
 is highly dimensional, considering the much smaller
 spike region can save a great deal of processing.
 As the spike protein mediates the attachment of the
 coronavirus to the host cell, most of the newer and
 more contagious variants can be characterized by
 alterations to the spike protein; hence it is often
 sufficient for characterizing the different
 SARS-CoV-2 variants. Another important consideration
 is to have a fast feature embedding generation,
 which is the subject of this work.  Applying any
 machine learning (ML) model to a biological sequence
 requires first transforming it into a fixed-length
 (numerical) form. While there exist several compact
 embeddings for SARS-CoV-2 spike protein sequences,
 the generation process is computationally expensive
 since the features, added to the resulting vectors,
 are indexed in a naïve fashion.  To solve this
 problem, we propose a fast and alignment-free
 hashing-based approach to design a fixed-length
 feature embedding for spike protein sequences,
 called Hashing2Vec, which can be used as input to
 any standard ML model. Using real-world data, we
 show that the proposed embedding is not only
 efficient to compute but also outperforms current
 state-of-the-art embedding methods in terms of
 classification accuracy. In terms of runtime, we
 achieve up to a 99.8% improvement in the
 Hashing2Vec-based embedding generation as compared
 to the baselines on a set of 7K spike amino acid
 sequences. It also outperforms the baselines on this
 data in terms of predictive performance and achieves
 accuracy and ROC-AUC scores of 86% and 84.4%,
 respectively.

APA


Taslim, M., Prakash, C., Sarwan, A. & Murray, P.. (2023). Hashing2Vec: Fast Embedding Generation for
 SARS-CoV-2 Spike Sequence Classification. Proceedings of The 14th Asian Conference on Machine
 Learning, in Proceedings of Machine Learning Research 189:754-769 Available from https://proceedings.mlr.press/v189/taslim23a.html.

Related Material

Download PDF