[edit]
Augmenting DNABERT Embeddings with Multimodal DNA Features for Improved Regulatory Sequence Interpretation
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:294-303, 2025.
Abstract
While DNABERT leverages k-mer embeddings to model genomic sequences, its exclusive reliance on nucleotide k-mers can limit its effectiveness in capturing regulatory elements that lack distinct motif signals or display subtle, compositionally diffuse patterns. In this study, we explore a multimodal approach by augmenting DNABERT embeddings with DNA-intrinsic features—including nucleotide composition, purine-pyrimidine balance, CpG density, and structural properties such as minor groove width and electrostatic potential. These physicochemical and sequence-derived features offer complementary information about DNA shape and stability, often critical in regulatory regions such as non-TATA promoters and certain transcription factor binding sites (TFBS). By integrating these features with DNABERT representations, we show improved model performance in terms of overall prediction accuracy and ability to interpret pattern-depleted regulatory sequences. We applied this framework, DNABERT-CoreProm-MM (Core Promoter model with MultiModalities), to the task of promoter prediction, with a focus on both TATA and non-TATA Core promoter sequences. Our results demonstrate that the DNABERT-CoreProm-MM model improves prediction accuracy by 3.72% for TATA promoters and 22.56% for non-TATA promoters. These findings highlight the value of sequence-intrinsic and shape feature multimodalities in enhancing the interpretability and accuracy of transformer-based models, particularly for genomic sequences lacking strong motif structure. This approach offers a more comprehensive and biologically informed framework for modeling DNA regulatory elements.