Jointly Modelling Uncertainty and Diversity for Active Molecular Property Prediction

Kuangqi Zhou, Kaixin Wang, Jian Tang, Jiashi Feng, Bryan Hooi, Peilin Zhao, Tingyang Xu, Xinchao Wang
Proceedings of the First Learning on Graphs Conference, PMLR 198:29:1-29:21, 2022.

Abstract

Molecular property prediction is a fundamental task in AI-driven drug discovery. Deep learning has achieved great success in this task, but relies heavily on abundant annotated data. However, annotating molecules is particularly costly because it often requires lab experiments conducted by experts. Active Learning (AL) tackles this issue by querying (i.e., selecting) the most valuable samples to annotate, according to two criteria: uncertainty of the model and diversity of data. Combining both criteria (a.k.a. hybrid AL) generally leads to better performance than using only one single criterion. However, existing best hybrid methods rely on some trade-off hyperparameters for balancing uncertainty and diversity, and hence need to carefully tune the hyperparameters in each experiment setting, causing great annotation and time inefficiency. In this paper, we propose a novel AL method that jointly models uncertainty and diversity without the trade-off hyperparameters. Specifically, we model the joint distribution of the labeled data and the model prediction. Based on this distribution, we introduce a Minimum Maximum Probability Querying (MMPQ) strategy, in which a single selection score naturally captures how the model is uncertain about its prediction, and how dissimilar the sample is to the currently labeled data. To model the joint distribution, we adapt the energy-based models to the non-Euclidean molecular graph data, by learning chemically-meaningful embedding vectors as the proxy of the graphs. We perform extensive experiments on binary classification datasets. Results show that our method achieves superior AL performance, outperforming existing methods by a large margin. We also conduct ablation studies to verify different design choices of our approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v198-zhou22b, title = {Jointly Modelling Uncertainty and Diversity for Active Molecular Property Prediction}, author = {Zhou, Kuangqi and Wang, Kaixin and Tang, Jian and Feng, Jiashi and Hooi, Bryan and Zhao, Peilin and Xu, Tingyang and Wang, Xinchao}, booktitle = {Proceedings of the First Learning on Graphs Conference}, pages = {29:1--29:21}, year = {2022}, editor = {Rieck, Bastian and Pascanu, Razvan}, volume = {198}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v198/zhou22b/zhou22b.pdf}, url = {https://proceedings.mlr.press/v198/zhou22b.html}, abstract = {Molecular property prediction is a fundamental task in AI-driven drug discovery. Deep learning has achieved great success in this task, but relies heavily on abundant annotated data. However, annotating molecules is particularly costly because it often requires lab experiments conducted by experts. Active Learning (AL) tackles this issue by querying (i.e., selecting) the most valuable samples to annotate, according to two criteria: uncertainty of the model and diversity of data. Combining both criteria (a.k.a. hybrid AL) generally leads to better performance than using only one single criterion. However, existing best hybrid methods rely on some trade-off hyperparameters for balancing uncertainty and diversity, and hence need to carefully tune the hyperparameters in each experiment setting, causing great annotation and time inefficiency. In this paper, we propose a novel AL method that jointly models uncertainty and diversity without the trade-off hyperparameters. Specifically, we model the joint distribution of the labeled data and the model prediction. Based on this distribution, we introduce a Minimum Maximum Probability Querying (MMPQ) strategy, in which a single selection score naturally captures how the model is uncertain about its prediction, and how dissimilar the sample is to the currently labeled data. To model the joint distribution, we adapt the energy-based models to the non-Euclidean molecular graph data, by learning chemically-meaningful embedding vectors as the proxy of the graphs. We perform extensive experiments on binary classification datasets. Results show that our method achieves superior AL performance, outperforming existing methods by a large margin. We also conduct ablation studies to verify different design choices of our approach.} }
Endnote
%0 Conference Paper %T Jointly Modelling Uncertainty and Diversity for Active Molecular Property Prediction %A Kuangqi Zhou %A Kaixin Wang %A Jian Tang %A Jiashi Feng %A Bryan Hooi %A Peilin Zhao %A Tingyang Xu %A Xinchao Wang %B Proceedings of the First Learning on Graphs Conference %C Proceedings of Machine Learning Research %D 2022 %E Bastian Rieck %E Razvan Pascanu %F pmlr-v198-zhou22b %I PMLR %P 29:1--29:21 %U https://proceedings.mlr.press/v198/zhou22b.html %V 198 %X Molecular property prediction is a fundamental task in AI-driven drug discovery. Deep learning has achieved great success in this task, but relies heavily on abundant annotated data. However, annotating molecules is particularly costly because it often requires lab experiments conducted by experts. Active Learning (AL) tackles this issue by querying (i.e., selecting) the most valuable samples to annotate, according to two criteria: uncertainty of the model and diversity of data. Combining both criteria (a.k.a. hybrid AL) generally leads to better performance than using only one single criterion. However, existing best hybrid methods rely on some trade-off hyperparameters for balancing uncertainty and diversity, and hence need to carefully tune the hyperparameters in each experiment setting, causing great annotation and time inefficiency. In this paper, we propose a novel AL method that jointly models uncertainty and diversity without the trade-off hyperparameters. Specifically, we model the joint distribution of the labeled data and the model prediction. Based on this distribution, we introduce a Minimum Maximum Probability Querying (MMPQ) strategy, in which a single selection score naturally captures how the model is uncertain about its prediction, and how dissimilar the sample is to the currently labeled data. To model the joint distribution, we adapt the energy-based models to the non-Euclidean molecular graph data, by learning chemically-meaningful embedding vectors as the proxy of the graphs. We perform extensive experiments on binary classification datasets. Results show that our method achieves superior AL performance, outperforming existing methods by a large margin. We also conduct ablation studies to verify different design choices of our approach.
APA
Zhou, K., Wang, K., Tang, J., Feng, J., Hooi, B., Zhao, P., Xu, T. & Wang, X.. (2022). Jointly Modelling Uncertainty and Diversity for Active Molecular Property Prediction. Proceedings of the First Learning on Graphs Conference, in Proceedings of Machine Learning Research 198:29:1-29:21 Available from https://proceedings.mlr.press/v198/zhou22b.html.

Related Material