Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Zhao Jiale; Wanru Zhuang; Jia Song; Yaqi Li; Shuqi Lu

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Zhao Jiale, Wanru Zhuang, Jia Song, Yaqi Li, Shuqi Lu

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:21791-21805, 2024.

Abstract

In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-jiale24a,
  title = 	 {Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3{D} Protein Chains},
  author =       {Jiale, Zhao and Zhuang, Wanru and Song, Jia and Li, Yaqi and Lu, Shuqi},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {21791--21805},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/jiale24a/jiale24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/jiale24a.html},
  abstract = 	 {In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.}
}

Endnote

%0 Conference Paper
%T Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains
%A Zhao Jiale
%A Wanru Zhuang
%A Jia Song
%A Yaqi Li
%A Shuqi Lu
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-jiale24a
%I PMLR
%P 21791--21805
%U https://proceedings.mlr.press/v235/jiale24a.html
%V 235
%X In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.

APA


Jiale, Z., Zhuang, W., Song, J., Li, Y. & Lu, S.. (2024). Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:21791-21805 Available from https://proceedings.mlr.press/v235/jiale24a.html.

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Abstract

Cite this Paper

Related Material