ImmSET: Sequence-Based Predictor of TCR-pMHC Specificity at Scale

Marco Garcia Noceda, Matthew T. Noakes, Andrew FigPope, Daniel E. Mattox, Bryan Howie, Harlan Robins
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1047-1074, 2026.

Abstract

T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor ({TCR}) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex ({pMHC}s). Predicting the specificity of {TCR}s for their cognate {pMHC}s is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein–protein interaction remains challenging due to the extreme diversity of both {TCR}s and {pMHC}s. Here, we present {ImmSET} (Immune Synapse Encoding Transformer), a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models’ generalization to {pMHC} targets. We describe a failure mode in prior sequence-based approaches that inflates previously reported performance on this task and show that {ImmSET} remains robust under stricter evaluation. In systematically testing the scaling behavior of {ImmSET} with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre-trained protein language model {ESM2} fine-tuned on the same datasets. Finally, we demonstrate that {ImmSET} can outperform AlphaFold2 and AlphaFold3-based pipelines on {TCR}-{pMHC} specificity prediction when provided sufficient training data. This work establishes {ImmSET} as a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in the {TCR}-{pMHC} setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction and experimental mapping.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-noceda26a, title = {{ImmSET}: Sequence-Based Predictor of {TCR}-{pMHC} Specificity at Scale}, author = {Noceda, Marco Garcia and Noakes, Matthew T. and FigPope, Andrew and Mattox, Daniel E. and Howie, Bryan and Robins, Harlan}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {1047--1074}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/noceda26a/noceda26a.pdf}, url = {https://proceedings.mlr.press/v297/noceda26a.html}, abstract = {T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor ({TCR}) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex ({pMHC}s). Predicting the specificity of {TCR}s for their cognate {pMHC}s is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein–protein interaction remains challenging due to the extreme diversity of both {TCR}s and {pMHC}s. Here, we present {ImmSET} (Immune Synapse Encoding Transformer), a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models’ generalization to {pMHC} targets. We describe a failure mode in prior sequence-based approaches that inflates previously reported performance on this task and show that {ImmSET} remains robust under stricter evaluation. In systematically testing the scaling behavior of {ImmSET} with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre-trained protein language model {ESM2} fine-tuned on the same datasets. Finally, we demonstrate that {ImmSET} can outperform AlphaFold2 and AlphaFold3-based pipelines on {TCR}-{pMHC} specificity prediction when provided sufficient training data. This work establishes {ImmSET} as a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in the {TCR}-{pMHC} setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction and experimental mapping.} }
Endnote
%0 Conference Paper %T ImmSET: Sequence-Based Predictor of TCR-pMHC Specificity at Scale %A Marco Garcia Noceda %A Matthew T. Noakes %A Andrew FigPope %A Daniel E. Mattox %A Bryan Howie %A Harlan Robins %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-noceda26a %I PMLR %P 1047--1074 %U https://proceedings.mlr.press/v297/noceda26a.html %V 297 %X T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor ({TCR}) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex ({pMHC}s). Predicting the specificity of {TCR}s for their cognate {pMHC}s is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein–protein interaction remains challenging due to the extreme diversity of both {TCR}s and {pMHC}s. Here, we present {ImmSET} (Immune Synapse Encoding Transformer), a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models’ generalization to {pMHC} targets. We describe a failure mode in prior sequence-based approaches that inflates previously reported performance on this task and show that {ImmSET} remains robust under stricter evaluation. In systematically testing the scaling behavior of {ImmSET} with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre-trained protein language model {ESM2} fine-tuned on the same datasets. Finally, we demonstrate that {ImmSET} can outperform AlphaFold2 and AlphaFold3-based pipelines on {TCR}-{pMHC} specificity prediction when provided sufficient training data. This work establishes {ImmSET} as a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in the {TCR}-{pMHC} setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction and experimental mapping.
APA
Noceda, M.G., Noakes, M.T., FigPope, A., Mattox, D.E., Howie, B. & Robins, H.. (2026). ImmSET: Sequence-Based Predictor of TCR-pMHC Specificity at Scale. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1047-1074 Available from https://proceedings.mlr.press/v297/noceda26a.html.

Related Material