Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction

Ruben Weitzman, Peter Mørch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Susan Marks, Pascal Notin
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:66397-66418, 2025.

Abstract

Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training mod- els on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time – offering a scalable alternative to alignment-centric approaches.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-weitzman25a, title = {Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction}, author = {Weitzman, Ruben and M{\o}rch Groth, Peter and Van Niekerk, Lood and Otani, Aoi and Gal, Yarin and Marks, Debora Susan and Notin, Pascal}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {66397--66418}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/weitzman25a/weitzman25a.pdf}, url = {https://proceedings.mlr.press/v267/weitzman25a.html}, abstract = {Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training mod- els on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time – offering a scalable alternative to alignment-centric approaches.} }
Endnote
%0 Conference Paper %T Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction %A Ruben Weitzman %A Peter Mørch Groth %A Lood Van Niekerk %A Aoi Otani %A Yarin Gal %A Debora Susan Marks %A Pascal Notin %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-weitzman25a %I PMLR %P 66397--66418 %U https://proceedings.mlr.press/v267/weitzman25a.html %V 267 %X Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training mod- els on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time – offering a scalable alternative to alignment-centric approaches.
APA
Weitzman, R., Mørch Groth, P., Van Niekerk, L., Otani, A., Gal, Y., Marks, D.S. & Notin, P.. (2025). Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:66397-66418 Available from https://proceedings.mlr.press/v267/weitzman25a.html.

Related Material