Characterizing uncertainty in predictions of genomic sequence-to-activity models

Ayesha Bajwa, Ruchir Rastogi, Pooja Kathail, Richard W. Shuai, Nilah Ioannidis
Proceedings of the 18th Machine Learning in Computational Biology meeting, PMLR 240:279-297, 2024.

Abstract

Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v240-bajwa24a, title = {Characterizing uncertainty in predictions of genomic sequence-to-activity models}, author = {Bajwa, Ayesha and Rastogi, Ruchir and Kathail, Pooja and Shuai, Richard W. and Ioannidis, Nilah}, booktitle = {Proceedings of the 18th Machine Learning in Computational Biology meeting}, pages = {279--297}, year = {2024}, editor = {Knowles, David A. and Mostafavi, Sara}, volume = {240}, series = {Proceedings of Machine Learning Research}, month = {30 Nov--01 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v240/bajwa24a/bajwa24a.pdf}, url = {https://proceedings.mlr.press/v240/bajwa24a.html}, abstract = {Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models. } }
Endnote
%0 Conference Paper %T Characterizing uncertainty in predictions of genomic sequence-to-activity models %A Ayesha Bajwa %A Ruchir Rastogi %A Pooja Kathail %A Richard W. Shuai %A Nilah Ioannidis %B Proceedings of the 18th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2024 %E David A. Knowles %E Sara Mostafavi %F pmlr-v240-bajwa24a %I PMLR %P 279--297 %U https://proceedings.mlr.press/v240/bajwa24a.html %V 240 %X Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.
APA
Bajwa, A., Rastogi, R., Kathail, P., Shuai, R.W. & Ioannidis, N.. (2024). Characterizing uncertainty in predictions of genomic sequence-to-activity models. Proceedings of the 18th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 240:279-297 Available from https://proceedings.mlr.press/v240/bajwa24a.html.

Related Material