CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:72485-72504, 2025.

Abstract

Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-GEN, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-GEN facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the ResidueControlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-GEN enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yin25g, title = {{CFP}-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models}, author = {Yin, Junbo and Zha, Chao and He, Wenjia and Xu, Chencheng and Gao, Xin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {72485--72504}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yin25g/yin25g.pdf}, url = {https://proceedings.mlr.press/v267/yin25g.html}, abstract = {Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-GEN, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-GEN facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the ResidueControlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-GEN enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins.} }
Endnote
%0 Conference Paper %T CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models %A Junbo Yin %A Chao Zha %A Wenjia He %A Chencheng Xu %A Xin Gao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yin25g %I PMLR %P 72485--72504 %U https://proceedings.mlr.press/v267/yin25g.html %V 267 %X Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-GEN, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-GEN facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the ResidueControlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-GEN enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins.
APA
Yin, J., Zha, C., He, W., Xu, C. & Gao, X.. (2025). CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:72485-72504 Available from https://proceedings.mlr.press/v267/yin25g.html.

Related Material