On Modelability and Generalizability: Are Machine Learning Models for Drug Synergy Exploiting Artefacts and Biases in Available Data?

Arushi G. K. Majha, Ian Stott, Andreas Bender
Proceedings of the 18th Machine Learning in Computational Biology meeting, PMLR 240:123-134, 2024.

Abstract

Synergy models are useful tools for exploring drug combinatorial search space and identifying promising sub-spaces for in vitro/vivo experiments. Here, we report that distributional biases in the training-validation-test sets used for predictive modeling of drug synergy can explain much of the variability observed in model performances (up to $0.22$ $\Delta$AUPRC). We built 145 classification models spanning 4,577 unique drugs and 75,276 pair-wise drug combinations extracted from DrugComb, and examined spurious correlations in both the input feature and output label spaces. We posit that some synergy datasets are easier to model than others due to factors such as synergy spread, class separation, chemical structural diversity, physicochemical diversity, combinatorial tests per drug, and combinatorial label entropy. We simulate distribution shifts for these dataset attributes and report that the drug-wise homogeneity of combinatorial labels most influences modelability ($0.16\pm0.06$ $\Delta$AUPRC). Our findings imply that seemingly high-performing drug synergy models may not generalize well to broader medicinal space. We caution that the synergy modeling community’s efforts may be better expended in examining data-specific artefacts and biases rigorously prior to model building.

Cite this Paper


BibTeX
@InProceedings{pmlr-v240-majha24a, title = {On Modelability and Generalizability: Are Machine Learning Models for Drug Synergy Exploiting Artefacts and Biases in Available Data?}, author = {Majha, Arushi G. K. and Stott, Ian and Bender, Andreas}, booktitle = {Proceedings of the 18th Machine Learning in Computational Biology meeting}, pages = {123--134}, year = {2024}, editor = {Knowles, David A. and Mostafavi, Sara}, volume = {240}, series = {Proceedings of Machine Learning Research}, month = {30 Nov--01 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v240/majha24a/majha24a.pdf}, url = {https://proceedings.mlr.press/v240/majha24a.html}, abstract = {Synergy models are useful tools for exploring drug combinatorial search space and identifying promising sub-spaces for in vitro/vivo experiments. Here, we report that distributional biases in the training-validation-test sets used for predictive modeling of drug synergy can explain much of the variability observed in model performances (up to $0.22$ $\Delta$AUPRC). We built 145 classification models spanning 4,577 unique drugs and 75,276 pair-wise drug combinations extracted from DrugComb, and examined spurious correlations in both the input feature and output label spaces. We posit that some synergy datasets are easier to model than others due to factors such as synergy spread, class separation, chemical structural diversity, physicochemical diversity, combinatorial tests per drug, and combinatorial label entropy. We simulate distribution shifts for these dataset attributes and report that the drug-wise homogeneity of combinatorial labels most influences modelability ($0.16\pm0.06$ $\Delta$AUPRC). Our findings imply that seemingly high-performing drug synergy models may not generalize well to broader medicinal space. We caution that the synergy modeling community’s efforts may be better expended in examining data-specific artefacts and biases rigorously prior to model building.} }
Endnote
%0 Conference Paper %T On Modelability and Generalizability: Are Machine Learning Models for Drug Synergy Exploiting Artefacts and Biases in Available Data? %A Arushi G. K. Majha %A Ian Stott %A Andreas Bender %B Proceedings of the 18th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2024 %E David A. Knowles %E Sara Mostafavi %F pmlr-v240-majha24a %I PMLR %P 123--134 %U https://proceedings.mlr.press/v240/majha24a.html %V 240 %X Synergy models are useful tools for exploring drug combinatorial search space and identifying promising sub-spaces for in vitro/vivo experiments. Here, we report that distributional biases in the training-validation-test sets used for predictive modeling of drug synergy can explain much of the variability observed in model performances (up to $0.22$ $\Delta$AUPRC). We built 145 classification models spanning 4,577 unique drugs and 75,276 pair-wise drug combinations extracted from DrugComb, and examined spurious correlations in both the input feature and output label spaces. We posit that some synergy datasets are easier to model than others due to factors such as synergy spread, class separation, chemical structural diversity, physicochemical diversity, combinatorial tests per drug, and combinatorial label entropy. We simulate distribution shifts for these dataset attributes and report that the drug-wise homogeneity of combinatorial labels most influences modelability ($0.16\pm0.06$ $\Delta$AUPRC). Our findings imply that seemingly high-performing drug synergy models may not generalize well to broader medicinal space. We caution that the synergy modeling community’s efforts may be better expended in examining data-specific artefacts and biases rigorously prior to model building.
APA
Majha, A.G.K., Stott, I. & Bender, A.. (2024). On Modelability and Generalizability: Are Machine Learning Models for Drug Synergy Exploiting Artefacts and Biases in Available Data?. Proceedings of the 18th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 240:123-134 Available from https://proceedings.mlr.press/v240/majha24a.html.

Related Material