A Machine Learning Framework for Predicting Natural Product-Protein Interactions

Jiabo Li, Shijie Gai, Wenfeng Shen, Zhou Lei
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:36-45, 2025.

Abstract

Natural products (NPs) are valuable resources for drug development, but accurately predicting their interactions with protein targets remains challenging due to the limitations of existing methods, which primarily rely on either ligand-based approaches or hybrid feature-based methods that require protein pocket data. To address these limitations, we developed a Y-shaped machine learning framework that integrates NP structural data with protein sequence information. We constructed a comprehensive NP-protein interaction dataset and extracted features from NPs, including Atom Sequence Path (ASP), PubChem, and Extended Connectivity Fingerprints (ECFP), as well as protein features such as Amino Acid Composition (AAC), Conjoint Triad (CTriad), and Dipeptide Composition (DPC). Six machine learning models—Random Forest (RF), AdaBoost, XGBoost, K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), and Logistic Regression (LR)—were trained and evaluated. Experimental results demonstrated that NP-derived PubChem features and protein-derived DPC features were the most effective, with XGBoost achieving the best performance among all models. Our study provides an efficient and generalizable framework for NP-protein interaction prediction, significantly advancing the potential for drug discovery.

Cite this Paper


BibTeX
@InProceedings{pmlr-v278-li25c, title = {A Machine Learning Framework for Predicting Natural Product-Protein Interactions}, author = {Li, Jiabo and Gai, Shijie and Shen, Wenfeng and Lei, Zhou}, booktitle = {Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing}, pages = {36--45}, year = {2025}, editor = {Zeng, Nianyin and Pachori, Ram Bilas and Wang, Dongshu}, volume = {278}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v278/main/assets/li25c/li25c.pdf}, url = {https://proceedings.mlr.press/v278/li25c.html}, abstract = { Natural products (NPs) are valuable resources for drug development, but accurately predicting their interactions with protein targets remains challenging due to the limitations of existing methods, which primarily rely on either ligand-based approaches or hybrid feature-based methods that require protein pocket data. To address these limitations, we developed a Y-shaped machine learning framework that integrates NP structural data with protein sequence information. We constructed a comprehensive NP-protein interaction dataset and extracted features from NPs, including Atom Sequence Path (ASP), PubChem, and Extended Connectivity Fingerprints (ECFP), as well as protein features such as Amino Acid Composition (AAC), Conjoint Triad (CTriad), and Dipeptide Composition (DPC). Six machine learning models—Random Forest (RF), AdaBoost, XGBoost, K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), and Logistic Regression (LR)—were trained and evaluated. Experimental results demonstrated that NP-derived PubChem features and protein-derived DPC features were the most effective, with XGBoost achieving the best performance among all models. Our study provides an efficient and generalizable framework for NP-protein interaction prediction, significantly advancing the potential for drug discovery.} }
Endnote
%0 Conference Paper %T A Machine Learning Framework for Predicting Natural Product-Protein Interactions %A Jiabo Li %A Shijie Gai %A Wenfeng Shen %A Zhou Lei %B Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing %C Proceedings of Machine Learning Research %D 2025 %E Nianyin Zeng %E Ram Bilas Pachori %E Dongshu Wang %F pmlr-v278-li25c %I PMLR %P 36--45 %U https://proceedings.mlr.press/v278/li25c.html %V 278 %X Natural products (NPs) are valuable resources for drug development, but accurately predicting their interactions with protein targets remains challenging due to the limitations of existing methods, which primarily rely on either ligand-based approaches or hybrid feature-based methods that require protein pocket data. To address these limitations, we developed a Y-shaped machine learning framework that integrates NP structural data with protein sequence information. We constructed a comprehensive NP-protein interaction dataset and extracted features from NPs, including Atom Sequence Path (ASP), PubChem, and Extended Connectivity Fingerprints (ECFP), as well as protein features such as Amino Acid Composition (AAC), Conjoint Triad (CTriad), and Dipeptide Composition (DPC). Six machine learning models—Random Forest (RF), AdaBoost, XGBoost, K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), and Logistic Regression (LR)—were trained and evaluated. Experimental results demonstrated that NP-derived PubChem features and protein-derived DPC features were the most effective, with XGBoost achieving the best performance among all models. Our study provides an efficient and generalizable framework for NP-protein interaction prediction, significantly advancing the potential for drug discovery.
APA
Li, J., Gai, S., Shen, W. & Lei, Z.. (2025). A Machine Learning Framework for Predicting Natural Product-Protein Interactions. Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, in Proceedings of Machine Learning Research 278:36-45 Available from https://proceedings.mlr.press/v278/li25c.html.

Related Material