[edit]
Continued domain-specific pre-training of protein language models for pMHC-I binding prediction
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:304-325, 2025.
Abstract
Predicting peptide-major histocompatibility complex I (pMHC-I) binding affinity remains challenging due to extreme allelic diversity ( 30,000 HLA alleles), severe data scarcity for most alleles, and noisy experimental measurements. Current methods particularly struggle with underrepresented alleles and quantitative binding prediction. We test whether domain-specific continued pre-training of protein language models (pLM) is beneficial for their application to pMHC-I binding affinity prediction. Starting from ESM Cambrian, we perform masked-language modeling based continued pre-training on HLA-associated peptides. We then fine-tune for functional IC50 binding affinity prediction using only high-quality quantitative data, avoiding mass spectrometry biases that are inherited by existing methods. After continued pre-training and fine-tuning, our resulting model (ESMCBA) achieves a median Spearman correlation of 0.61 for predicting binding affinity across 24 common HLA alleles, outperforming other state-of-the-art predictors. Continued pre-training provides consistent gains relative to models that are directly fine-tuned without the continued pre-training step, particularly for alleles with smaller data, improving correlations by 0.08 over models, with diminishing returns as training data increases beyond 3000 peptides, where pretrained and non-pretrained models converge to similar performance (0.57). Additionally, the method requires substantial computational resources and performance remains fundamentally limited by the inherent noise and experimental heterogeneity in binding affinity measurements from diverse assay protocols. This work has important potential application to neoantigen vaccine prioritization and provides a framework for improving pLMs performance on specialized tasks through domain-specific continued pre-training.