A Genomic Language Model for Zero-Shot Prediction of Promoter Indel Effects

Courtney Shearer, Rose Orenbuch, Felix Teufel, Christian Steinmetz, Daniel Ritter, Erik Xie, Artem Gazizov, Aviv Spinner, Mafalda Dias, Jonathan Frazer, Pascal Notin, Debora Marks
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:95-127, 2025.

Abstract

Disease-associated genetic variants occur extensively across the human genome, predominantly in noncoding regions like promoters. While crucial for understanding disease mechanisms, current methods struggle to predict effects of insertions and deletions (indels) that can disrupt gene expression. We present LOL-EVE (Language Of Life for Evolutionary Variant Effects), a conditional autoregressive transformer trained on 13.6 million mammalian promoter sequences. By leveraging evolutionary patterns and genetic context, LOL-EVE enables zero-shot prediction of indel effects in human promoters. We introduce three new benchmarks for promoter indel prediction: ultra rare variant prioritization, causal eQTL identification, and transcription factor binding site disruption analysis. LOL-EVE’s superior performance across these tasks suggests the potential of region-specific autoregressive genomic language models for identifying disease-causing non-coding variants.

Cite this Paper


BibTeX
@InProceedings{pmlr-v311-shearer25a, title = {A Genomic Language Model for Zero-Shot Prediction of Promoter Indel Effects}, author = {Shearer, Courtney and Orenbuch, Rose and Teufel, Felix and Steinmetz, Christian and Ritter, Daniel and Xie, Erik and Gazizov, Artem and Spinner, Aviv and Dias, Mafalda and Frazer, Jonathan and Notin, Pascal and Marks, Debora}, booktitle = {Proceedings of the 20th Machine Learning in Computational Biology meeting}, pages = {95--127}, year = {2025}, editor = {Knowles, David A and Koo, Peter K}, volume = {311}, series = {Proceedings of Machine Learning Research}, month = {10--11 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v311/main/assets/shearer25a/shearer25a.pdf}, url = {https://proceedings.mlr.press/v311/shearer25a.html}, abstract = {Disease-associated genetic variants occur extensively across the human genome, predominantly in noncoding regions like promoters. While crucial for understanding disease mechanisms, current methods struggle to predict effects of insertions and deletions (indels) that can disrupt gene expression. We present LOL-EVE (Language Of Life for Evolutionary Variant Effects), a conditional autoregressive transformer trained on 13.6 million mammalian promoter sequences. By leveraging evolutionary patterns and genetic context, LOL-EVE enables zero-shot prediction of indel effects in human promoters. We introduce three new benchmarks for promoter indel prediction: ultra rare variant prioritization, causal eQTL identification, and transcription factor binding site disruption analysis. LOL-EVE’s superior performance across these tasks suggests the potential of region-specific autoregressive genomic language models for identifying disease-causing non-coding variants.} }
Endnote
%0 Conference Paper %T A Genomic Language Model for Zero-Shot Prediction of Promoter Indel Effects %A Courtney Shearer %A Rose Orenbuch %A Felix Teufel %A Christian Steinmetz %A Daniel Ritter %A Erik Xie %A Artem Gazizov %A Aviv Spinner %A Mafalda Dias %A Jonathan Frazer %A Pascal Notin %A Debora Marks %B Proceedings of the 20th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2025 %E David A Knowles %E Peter K Koo %F pmlr-v311-shearer25a %I PMLR %P 95--127 %U https://proceedings.mlr.press/v311/shearer25a.html %V 311 %X Disease-associated genetic variants occur extensively across the human genome, predominantly in noncoding regions like promoters. While crucial for understanding disease mechanisms, current methods struggle to predict effects of insertions and deletions (indels) that can disrupt gene expression. We present LOL-EVE (Language Of Life for Evolutionary Variant Effects), a conditional autoregressive transformer trained on 13.6 million mammalian promoter sequences. By leveraging evolutionary patterns and genetic context, LOL-EVE enables zero-shot prediction of indel effects in human promoters. We introduce three new benchmarks for promoter indel prediction: ultra rare variant prioritization, causal eQTL identification, and transcription factor binding site disruption analysis. LOL-EVE’s superior performance across these tasks suggests the potential of region-specific autoregressive genomic language models for identifying disease-causing non-coding variants.
APA
Shearer, C., Orenbuch, R., Teufel, F., Steinmetz, C., Ritter, D., Xie, E., Gazizov, A., Spinner, A., Dias, M., Frazer, J., Notin, P. & Marks, D.. (2025). A Genomic Language Model for Zero-Shot Prediction of Promoter Indel Effects. Proceedings of the 20th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 311:95-127 Available from https://proceedings.mlr.press/v311/shearer25a.html.

Related Material