Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Jiasheng Zhang; Delvin Ce Zhang; Shuang Liang; Zhengpin Li; Zhitao Ying; Jie Shao

Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Jiasheng Zhang, Delvin Ce Zhang, Shuang Liang, Zhengpin Li, Zhitao Ying, Jie Shao

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76795-76810, 2025.

Abstract

Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhang25cz,
  title = 	 {Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding},
  author =       {Zhang, Jiasheng and Zhang, Delvin Ce and Liang, Shuang and Li, Zhengpin and Ying, Zhitao and Shao, Jie},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {76795--76810},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25cz/zhang25cz.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhang25cz.html},
  abstract = 	 {Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.}
}

Endnote

%0 Conference Paper
%T Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding
%A Jiasheng Zhang
%A Delvin Ce Zhang
%A Shuang Liang
%A Zhengpin Li
%A Zhitao Ying
%A Jie Shao
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhang25cz
%I PMLR
%P 76795--76810
%U https://proceedings.mlr.press/v267/zhang25cz.html
%V 267
%X Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.

APA

Zhang, J., Zhang, D.C., Liang, S., Li, Z., Ying, Z. & Shao, J.. (2025). Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76795-76810 Available from https://proceedings.mlr.press/v267/zhang25cz.html.

Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Abstract

Cite this Paper

Related Material