Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage

Alan J.X. Guo, Cong Liang, Qing-Hu Hou
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:8095-8108, 2022.

Abstract

Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-guo22f, title = {Deep Squared {E}uclidean Approximation to the Levenshtein Distance for {DNA} Storage}, author = {Guo, Alan J.X. and Liang, Cong and Hou, Qing-Hu}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {8095--8108}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/guo22f/guo22f.pdf}, url = {https://proceedings.mlr.press/v162/guo22f.html}, abstract = {Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.} }
Endnote
%0 Conference Paper %T Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage %A Alan J.X. Guo %A Cong Liang %A Qing-Hu Hou %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-guo22f %I PMLR %P 8095--8108 %U https://proceedings.mlr.press/v162/guo22f.html %V 162 %X Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.
APA
Guo, A.J., Liang, C. & Hou, Q.. (2022). Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:8095-8108 Available from https://proceedings.mlr.press/v162/guo22f.html.

Related Material