[edit]
Hashing2Vec: Fast Embedding Generation for SARS-CoV-2 Spike Sequence Classification
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:754-769, 2023.
Abstract
Due to the ongoing coronavirus (COVID-19) pandemic,
an unprecedented amount of SARS-CoV-2 sequence data
is available. The scale of this data has out-paced
traditional methods for its analysis, while
machine-learning approaches aimed at clustering and
classification of SARS-CoV-2 variants is becoming an
attractive alternative. Since the SARS-CoV-2 genome
is highly dimensional, considering the much smaller
spike region can save a great deal of processing.
As the spike protein mediates the attachment of the
coronavirus to the host cell, most of the newer and
more contagious variants can be characterized by
alterations to the spike protein; hence it is often
sufficient for characterizing the different
SARS-CoV-2 variants. Another important consideration
is to have a fast feature embedding generation,
which is the subject of this work. Applying any
machine learning (ML) model to a biological sequence
requires first transforming it into a fixed-length
(numerical) form. While there exist several compact
embeddings for SARS-CoV-2 spike protein sequences,
the generation process is computationally expensive
since the features, added to the resulting vectors,
are indexed in a naïve fashion. To solve this
problem, we propose a fast and alignment-free
hashing-based approach to design a fixed-length
feature embedding for spike protein sequences,
called Hashing2Vec, which can be used as input to
any standard ML model. Using real-world data, we
show that the proposed embedding is not only
efficient to compute but also outperforms current
state-of-the-art embedding methods in terms of
classification accuracy. In terms of runtime, we
achieve up to a 99.8% improvement in the
Hashing2Vec-based embedding generation as compared
to the baselines on a set of 7K spike amino acid
sequences. It also outperforms the baselines on this
data in terms of predictive performance and achieves
accuracy and ROC-AUC scores of 86% and 84.4%,
respectively.