[edit]
A Genomic Language Model for Zero-Shot Prediction of Promoter Indel Effects
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:95-127, 2025.
Abstract
Disease-associated genetic variants occur extensively across the human genome, predominantly in noncoding regions like promoters. While crucial for understanding disease mechanisms, current methods struggle to predict effects of insertions and deletions (indels) that can disrupt gene expression. We present LOL-EVE (Language Of Life for Evolutionary Variant Effects), a conditional autoregressive transformer trained on 13.6 million mammalian promoter sequences. By leveraging evolutionary patterns and genetic context, LOL-EVE enables zero-shot prediction of indel effects in human promoters. We introduce three new benchmarks for promoter indel prediction: ultra rare variant prioritization, causal eQTL identification, and transcription factor binding site disruption analysis. LOL-EVE’s superior performance across these tasks suggests the potential of region-specific autoregressive genomic language models for identifying disease-causing non-coding variants.