MSA Transformer

Roshan M Rao; Jason Liu; Robert Verkuil; Joshua Meier; John Canny; Pieter Abbeel; Tom Sercu; Alexander Rives

MSA Transformer

Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, Alexander Rives

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8844-8856, 2021.

Abstract

Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-rao21a,
  title = 	 {MSA Transformer},
  author =       {Rao, Roshan M and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {8844--8856},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/rao21a/rao21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/rao21a.html},
  abstract = 	 {Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.}
}

Endnote

%0 Conference Paper
%T MSA Transformer
%A Roshan M Rao
%A Jason Liu
%A Robert Verkuil
%A Joshua Meier
%A John Canny
%A Pieter Abbeel
%A Tom Sercu
%A Alexander Rives
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-rao21a
%I PMLR
%P 8844--8856
%U https://proceedings.mlr.press/v139/rao21a.html
%V 139
%X Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

APA

Rao, R.M., Liu, J., Verkuil, R., Meier, J., Canny, J., Abbeel, P., Sercu, T. & Rives, A.. (2021). MSA Transformer. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8844-8856 Available from https://proceedings.mlr.press/v139/rao21a.html.

MSA Transformer

Abstract

Cite this Paper

Related Material