Evolution-Inspired Loss Functions for Protein Representation Learning

Chengyue Gong, Adam Klivans, James Madigan Loy, Tianlong Chen, Qiang Liu, Daniel Jesus Diaz
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:15893-15906, 2024.

Abstract

AI-based frameworks for protein engineering use self-supervised learning (SSL) to obtain representations for downstream mutation effect predictions. The most common training objective for these methods is wildtype accuracy: given a sequence or structure where a wildtype residue has been masked, predict the missing amino acid. Wildtype accuracy, however, does not align with the primary goal of protein engineering, which is to suggest a mutation rather than to identify what already appears in nature. Here we present Evolutionary Ranking (EvoRank), a training objective that incorporates evolutionary information derived from multiple sequence alignments (MSAs) to learn more diverse protein representations. EvoRank corresponds to ranking amino-acid likelihoods in the probability distribution induced by an MSA. This objective forces models to learn the underlying evolutionary dynamics of a protein. Across a variety of phenotypes and datasets, we demonstrate that EvoRank leads to dramatic improvements in zero-shot performance and can compete with models fine-tuned on experimental data. This is particularly important in protein engineering, where it is expensive to obtain data for fine-tuning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-gong24e, title = {Evolution-Inspired Loss Functions for Protein Representation Learning}, author = {Gong, Chengyue and Klivans, Adam and Loy, James Madigan and Chen, Tianlong and Liu, Qiang and Diaz, Daniel Jesus}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {15893--15906}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gong24e/gong24e.pdf}, url = {https://proceedings.mlr.press/v235/gong24e.html}, abstract = {AI-based frameworks for protein engineering use self-supervised learning (SSL) to obtain representations for downstream mutation effect predictions. The most common training objective for these methods is wildtype accuracy: given a sequence or structure where a wildtype residue has been masked, predict the missing amino acid. Wildtype accuracy, however, does not align with the primary goal of protein engineering, which is to suggest a mutation rather than to identify what already appears in nature. Here we present Evolutionary Ranking (EvoRank), a training objective that incorporates evolutionary information derived from multiple sequence alignments (MSAs) to learn more diverse protein representations. EvoRank corresponds to ranking amino-acid likelihoods in the probability distribution induced by an MSA. This objective forces models to learn the underlying evolutionary dynamics of a protein. Across a variety of phenotypes and datasets, we demonstrate that EvoRank leads to dramatic improvements in zero-shot performance and can compete with models fine-tuned on experimental data. This is particularly important in protein engineering, where it is expensive to obtain data for fine-tuning.} }
Endnote
%0 Conference Paper %T Evolution-Inspired Loss Functions for Protein Representation Learning %A Chengyue Gong %A Adam Klivans %A James Madigan Loy %A Tianlong Chen %A Qiang Liu %A Daniel Jesus Diaz %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-gong24e %I PMLR %P 15893--15906 %U https://proceedings.mlr.press/v235/gong24e.html %V 235 %X AI-based frameworks for protein engineering use self-supervised learning (SSL) to obtain representations for downstream mutation effect predictions. The most common training objective for these methods is wildtype accuracy: given a sequence or structure where a wildtype residue has been masked, predict the missing amino acid. Wildtype accuracy, however, does not align with the primary goal of protein engineering, which is to suggest a mutation rather than to identify what already appears in nature. Here we present Evolutionary Ranking (EvoRank), a training objective that incorporates evolutionary information derived from multiple sequence alignments (MSAs) to learn more diverse protein representations. EvoRank corresponds to ranking amino-acid likelihoods in the probability distribution induced by an MSA. This objective forces models to learn the underlying evolutionary dynamics of a protein. Across a variety of phenotypes and datasets, we demonstrate that EvoRank leads to dramatic improvements in zero-shot performance and can compete with models fine-tuned on experimental data. This is particularly important in protein engineering, where it is expensive to obtain data for fine-tuning.
APA
Gong, C., Klivans, A., Loy, J.M., Chen, T., Liu, Q. & Diaz, D.J.. (2024). Evolution-Inspired Loss Functions for Protein Representation Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:15893-15906 Available from https://proceedings.mlr.press/v235/gong24e.html.

Related Material