Continued domain-specific pre-training of protein language models for pMHC-I binding prediction

Sergio Mares, Ariel Espinoza, Nilah Ioannidis
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:304-325, 2025.

Abstract

Predicting peptide-major histocompatibility complex I (pMHC-I) binding affinity remains challenging due to extreme allelic diversity ( 30,000 HLA alleles), severe data scarcity for most alleles, and noisy experimental measurements. Current methods particularly struggle with underrepresented alleles and quantitative binding prediction. We test whether domain-specific continued pre-training of protein language models (pLM) is beneficial for their application to pMHC-I binding affinity prediction. Starting from ESM Cambrian, we perform masked-language modeling based continued pre-training on HLA-associated peptides. We then fine-tune for functional IC50 binding affinity prediction using only high-quality quantitative data, avoiding mass spectrometry biases that are inherited by existing methods. After continued pre-training and fine-tuning, our resulting model (ESMCBA) achieves a median Spearman correlation of 0.61 for predicting binding affinity across 24 common HLA alleles, outperforming other state-of-the-art predictors. Continued pre-training provides consistent gains relative to models that are directly fine-tuned without the continued pre-training step, particularly for alleles with smaller data, improving correlations by  0.08 over models, with diminishing returns as training data increases beyond 3000 peptides, where pretrained and non-pretrained models converge to similar performance (0.57). Additionally, the method requires substantial computational resources and performance remains fundamentally limited by the inherent noise and experimental heterogeneity in binding affinity measurements from diverse assay protocols. This work has important potential application to neoantigen vaccine prioritization and provides a framework for improving pLMs performance on specialized tasks through domain-specific continued pre-training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v311-mares25a, title = {Continued domain-specific pre-training of protein language models for pMHC-I binding prediction}, author = {Mares, Sergio and Espinoza, Ariel and Ioannidis, Nilah}, booktitle = {Proceedings of the 20th Machine Learning in Computational Biology meeting}, pages = {304--325}, year = {2025}, editor = {Knowles, David A and Koo, Peter K}, volume = {311}, series = {Proceedings of Machine Learning Research}, month = {10--11 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v311/main/assets/mares25a/mares25a.pdf}, url = {https://proceedings.mlr.press/v311/mares25a.html}, abstract = {Predicting peptide-major histocompatibility complex I (pMHC-I) binding affinity remains challenging due to extreme allelic diversity ( 30,000 HLA alleles), severe data scarcity for most alleles, and noisy experimental measurements. Current methods particularly struggle with underrepresented alleles and quantitative binding prediction. We test whether domain-specific continued pre-training of protein language models (pLM) is beneficial for their application to pMHC-I binding affinity prediction. Starting from ESM Cambrian, we perform masked-language modeling based continued pre-training on HLA-associated peptides. We then fine-tune for functional IC50 binding affinity prediction using only high-quality quantitative data, avoiding mass spectrometry biases that are inherited by existing methods. After continued pre-training and fine-tuning, our resulting model (ESMCBA) achieves a median Spearman correlation of 0.61 for predicting binding affinity across 24 common HLA alleles, outperforming other state-of-the-art predictors. Continued pre-training provides consistent gains relative to models that are directly fine-tuned without the continued pre-training step, particularly for alleles with smaller data, improving correlations by  0.08 over models, with diminishing returns as training data increases beyond 3000 peptides, where pretrained and non-pretrained models converge to similar performance (0.57). Additionally, the method requires substantial computational resources and performance remains fundamentally limited by the inherent noise and experimental heterogeneity in binding affinity measurements from diverse assay protocols. This work has important potential application to neoantigen vaccine prioritization and provides a framework for improving pLMs performance on specialized tasks through domain-specific continued pre-training.} }
Endnote
%0 Conference Paper %T Continued domain-specific pre-training of protein language models for pMHC-I binding prediction %A Sergio Mares %A Ariel Espinoza %A Nilah Ioannidis %B Proceedings of the 20th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2025 %E David A Knowles %E Peter K Koo %F pmlr-v311-mares25a %I PMLR %P 304--325 %U https://proceedings.mlr.press/v311/mares25a.html %V 311 %X Predicting peptide-major histocompatibility complex I (pMHC-I) binding affinity remains challenging due to extreme allelic diversity ( 30,000 HLA alleles), severe data scarcity for most alleles, and noisy experimental measurements. Current methods particularly struggle with underrepresented alleles and quantitative binding prediction. We test whether domain-specific continued pre-training of protein language models (pLM) is beneficial for their application to pMHC-I binding affinity prediction. Starting from ESM Cambrian, we perform masked-language modeling based continued pre-training on HLA-associated peptides. We then fine-tune for functional IC50 binding affinity prediction using only high-quality quantitative data, avoiding mass spectrometry biases that are inherited by existing methods. After continued pre-training and fine-tuning, our resulting model (ESMCBA) achieves a median Spearman correlation of 0.61 for predicting binding affinity across 24 common HLA alleles, outperforming other state-of-the-art predictors. Continued pre-training provides consistent gains relative to models that are directly fine-tuned without the continued pre-training step, particularly for alleles with smaller data, improving correlations by  0.08 over models, with diminishing returns as training data increases beyond 3000 peptides, where pretrained and non-pretrained models converge to similar performance (0.57). Additionally, the method requires substantial computational resources and performance remains fundamentally limited by the inherent noise and experimental heterogeneity in binding affinity measurements from diverse assay protocols. This work has important potential application to neoantigen vaccine prioritization and provides a framework for improving pLMs performance on specialized tasks through domain-specific continued pre-training.
APA
Mares, S., Espinoza, A. & Ioannidis, N.. (2025). Continued domain-specific pre-training of protein language models for pMHC-I binding prediction. Proceedings of the 20th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 311:304-325 Available from https://proceedings.mlr.press/v311/mares25a.html.

Related Material