Somatic Hypermutation Informed Vocabulary Encoder Representations

Chiho Im, Artem Mikelov, Ryan Zhao, Anshul Kundaje, Scott Boyd
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:240-250, 2025.

Abstract

Somatic hypermutations (SHMs) acquired during affinity maturation of memory B cell receptors (mBCRs) carry important immunological signals, but remain challenging for protein language models (PLMs) to capture effectively. We introduce SHIVER, a mutation-aware antibody language model that treats each amino acid substitution as a distinct token, allowing the model to directly encode the context-dependent impact of SHMs. Trained on paired heavy and light chain sequences from human mBCR repertoires, SHIVER incorporates a tailored vocabulary, a subsampling strategy for data augmentation, and a mutation-focused masking scheme to better model the dynamics of affinity maturation. We evaluate SHIVER on the task of predicting mBCR binding to influenza antigens and find that it outperforms both general and antibody-specific PLMs using a simple logistic head. Our results suggest that explicitly modeling SHMs improves biological relevance and generalization of learned representations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v311-im25a, title = {Somatic Hypermutation Informed Vocabulary Encoder Representations}, author = {Im, Chiho and Mikelov, Artem and Zhao, Ryan and Kundaje, Anshul and Boyd, Scott}, booktitle = {Proceedings of the 20th Machine Learning in Computational Biology meeting}, pages = {240--250}, year = {2025}, editor = {Knowles, David A and Koo, Peter K}, volume = {311}, series = {Proceedings of Machine Learning Research}, month = {10--11 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v311/main/assets/im25a/im25a.pdf}, url = {https://proceedings.mlr.press/v311/im25a.html}, abstract = {Somatic hypermutations (SHMs) acquired during affinity maturation of memory B cell receptors (mBCRs) carry important immunological signals, but remain challenging for protein language models (PLMs) to capture effectively. We introduce SHIVER, a mutation-aware antibody language model that treats each amino acid substitution as a distinct token, allowing the model to directly encode the context-dependent impact of SHMs. Trained on paired heavy and light chain sequences from human mBCR repertoires, SHIVER incorporates a tailored vocabulary, a subsampling strategy for data augmentation, and a mutation-focused masking scheme to better model the dynamics of affinity maturation. We evaluate SHIVER on the task of predicting mBCR binding to influenza antigens and find that it outperforms both general and antibody-specific PLMs using a simple logistic head. Our results suggest that explicitly modeling SHMs improves biological relevance and generalization of learned representations.} }
Endnote
%0 Conference Paper %T Somatic Hypermutation Informed Vocabulary Encoder Representations %A Chiho Im %A Artem Mikelov %A Ryan Zhao %A Anshul Kundaje %A Scott Boyd %B Proceedings of the 20th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2025 %E David A Knowles %E Peter K Koo %F pmlr-v311-im25a %I PMLR %P 240--250 %U https://proceedings.mlr.press/v311/im25a.html %V 311 %X Somatic hypermutations (SHMs) acquired during affinity maturation of memory B cell receptors (mBCRs) carry important immunological signals, but remain challenging for protein language models (PLMs) to capture effectively. We introduce SHIVER, a mutation-aware antibody language model that treats each amino acid substitution as a distinct token, allowing the model to directly encode the context-dependent impact of SHMs. Trained on paired heavy and light chain sequences from human mBCR repertoires, SHIVER incorporates a tailored vocabulary, a subsampling strategy for data augmentation, and a mutation-focused masking scheme to better model the dynamics of affinity maturation. We evaluate SHIVER on the task of predicting mBCR binding to influenza antigens and find that it outperforms both general and antibody-specific PLMs using a simple logistic head. Our results suggest that explicitly modeling SHMs improves biological relevance and generalization of learned representations.
APA
Im, C., Mikelov, A., Zhao, R., Kundaje, A. & Boyd, S.. (2025). Somatic Hypermutation Informed Vocabulary Encoder Representations. Proceedings of the 20th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 311:240-250 Available from https://proceedings.mlr.press/v311/im25a.html.

Related Material