Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Yair Schiff, Chia Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:43632-43648, 2024.

Abstract

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here: https://github.com/kuleshov-group/caduceus.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-schiff24a, title = {Caduceus: Bi-Directional Equivariant Long-Range {DNA} Sequence Modeling}, author = {Schiff, Yair and Kao, Chia Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {43632--43648}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/schiff24a/schiff24a.pdf}, url = {https://proceedings.mlr.press/v235/schiff24a.html}, abstract = {Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here: https://github.com/kuleshov-group/caduceus.} }
Endnote
%0 Conference Paper %T Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling %A Yair Schiff %A Chia Hsiang Kao %A Aaron Gokaslan %A Tri Dao %A Albert Gu %A Volodymyr Kuleshov %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-schiff24a %I PMLR %P 43632--43648 %U https://proceedings.mlr.press/v235/schiff24a.html %V 235 %X Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here: https://github.com/kuleshov-group/caduceus.
APA
Schiff, Y., Kao, C.H., Gokaslan, A., Dao, T., Gu, A. & Kuleshov, V.. (2024). Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:43632-43648 Available from https://proceedings.mlr.press/v235/schiff24a.html.

Related Material