[edit]
Jointly Modelling Uncertainty and Diversity for Active Molecular Property Prediction
Proceedings of the First Learning on Graphs Conference, PMLR 198:29:1-29:21, 2022.
Abstract
Molecular property prediction is a fundamental task in AI-driven drug discovery. Deep learning has achieved great success in this task, but relies heavily on abundant annotated data. However, annotating molecules is particularly costly because it often requires lab experiments conducted by experts. Active Learning (AL) tackles this issue by querying (i.e., selecting) the most valuable samples to annotate, according to two criteria: uncertainty of the model and diversity of data. Combining both criteria (a.k.a. hybrid AL) generally leads to better performance than using only one single criterion. However, existing best hybrid methods rely on some trade-off hyperparameters for balancing uncertainty and diversity, and hence need to carefully tune the hyperparameters in each experiment setting, causing great annotation and time inefficiency. In this paper, we propose a novel AL method that jointly models uncertainty and diversity without the trade-off hyperparameters. Specifically, we model the joint distribution of the labeled data and the model prediction. Based on this distribution, we introduce a Minimum Maximum Probability Querying (MMPQ) strategy, in which a single selection score naturally captures how the model is uncertain about its prediction, and how dissimilar the sample is to the currently labeled data. To model the joint distribution, we adapt the energy-based models to the non-Euclidean molecular graph data, by learning chemically-meaningful embedding vectors as the proxy of the graphs. We perform extensive experiments on binary classification datasets. Results show that our method achieves superior AL performance, outperforming existing methods by a large margin. We also conduct ablation studies to verify different design choices of our approach.