Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Jiasheng Zhang, Delvin Ce Zhang, Shuang Liang, Zhengpin Li, Zhitao Ying, Jie Shao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76795-76810, 2025.

Abstract

Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25cz, title = {Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding}, author = {Zhang, Jiasheng and Zhang, Delvin Ce and Liang, Shuang and Li, Zhengpin and Ying, Zhitao and Shao, Jie}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {76795--76810}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25cz/zhang25cz.pdf}, url = {https://proceedings.mlr.press/v267/zhang25cz.html}, abstract = {Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.} }
Endnote
%0 Conference Paper %T Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding %A Jiasheng Zhang %A Delvin Ce Zhang %A Shuang Liang %A Zhengpin Li %A Zhitao Ying %A Jie Shao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25cz %I PMLR %P 76795--76810 %U https://proceedings.mlr.press/v267/zhang25cz.html %V 267 %X Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.
APA
Zhang, J., Zhang, D.C., Liang, S., Li, Z., Ying, Z. & Shao, J.. (2025). Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76795-76810 Available from https://proceedings.mlr.press/v267/zhang25cz.html.

Related Material