[edit]
A Machine Learning Framework for Predicting Natural Product-Protein Interactions
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:36-45, 2025.
Abstract
Natural products (NPs) are valuable resources for drug development, but accurately predicting their interactions with protein targets remains challenging due to the limitations of existing methods, which primarily rely on either ligand-based approaches or hybrid feature-based methods that require protein pocket data. To address these limitations, we developed a Y-shaped machine learning framework that integrates NP structural data with protein sequence information. We constructed a comprehensive NP-protein interaction dataset and extracted features from NPs, including Atom Sequence Path (ASP), PubChem, and Extended Connectivity Fingerprints (ECFP), as well as protein features such as Amino Acid Composition (AAC), Conjoint Triad (CTriad), and Dipeptide Composition (DPC). Six machine learning models—Random Forest (RF), AdaBoost, XGBoost, K-Nearest Neighbors (KNN), Stochastic Gradient Descent (SGD), and Logistic Regression (LR)—were trained and evaluated. Experimental results demonstrated that NP-derived PubChem features and protein-derived DPC features were the most effective, with XGBoost achieving the best performance among all models. Our study provides an efficient and generalizable framework for NP-protein interaction prediction, significantly advancing the potential for drug discovery.