Towards functional annotation with latent protein language model features

Jacob Silberg, Elana Simon, James Zou
Proceedings of the 20th Machine Learning in Computational Biology meeting, PMLR 311:280-293, 2025.

Abstract

Protein Language Models (PLMs) create high-dimensional embeddings that can be transformed into interpretable sparse features using Sparse Autoencoders (SAEs), where each feature activates on specific protein elements or patterns. However, scalably identifying which features are cohesive and reliable enough for protein annotation remains challenging. We address this by developing a validation pipeline combining three complementary methods: (1) expanded database matching across 20+ annotation sources including hierarchical codes, (2) feature-guided local structural alignment to identify structurally consistent activation regions, and (3) LLM-based feature description generation. Our annotation pipeline demonstrates three key properties of SAE features that make them a useful source of functional annotation complementary to existing methods. First, they can represent more granular patterns than existing protein databases, enabling the identification of sub-domains. Second, they can detect missing annotations by finding proteins that display recognizable structural motifs but lack corresponding database labels. Here, we automatically identify at least 491 missing CATH topology annotations with our pipeline. Third, they can maintain structural consistency across unseen proteins. Of our 10,240 SAE features, we find 615 that are consistently structurally similar in unannotated metagenomic proteins, allowing us to structurally match at least 8,077 metagenomic proteins to characterized proteins. This provides a rapid annotation pipeline with constant time search regardless of database size, that automatically includes structural and function information about the feature that triggered the match.

Cite this Paper


BibTeX
@InProceedings{pmlr-v311-silberg25a, title = {Towards functional annotation with latent protein language model features}, author = {Silberg, Jacob and Simon, Elana and Zou, James}, booktitle = {Proceedings of the 20th Machine Learning in Computational Biology meeting}, pages = {280--293}, year = {2025}, editor = {Knowles, David A and Koo, Peter K}, volume = {311}, series = {Proceedings of Machine Learning Research}, month = {10--11 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v311/main/assets/silberg25a/silberg25a.pdf}, url = {https://proceedings.mlr.press/v311/silberg25a.html}, abstract = {Protein Language Models (PLMs) create high-dimensional embeddings that can be transformed into interpretable sparse features using Sparse Autoencoders (SAEs), where each feature activates on specific protein elements or patterns. However, scalably identifying which features are cohesive and reliable enough for protein annotation remains challenging. We address this by developing a validation pipeline combining three complementary methods: (1) expanded database matching across 20+ annotation sources including hierarchical codes, (2) feature-guided local structural alignment to identify structurally consistent activation regions, and (3) LLM-based feature description generation. Our annotation pipeline demonstrates three key properties of SAE features that make them a useful source of functional annotation complementary to existing methods. First, they can represent more granular patterns than existing protein databases, enabling the identification of sub-domains. Second, they can detect missing annotations by finding proteins that display recognizable structural motifs but lack corresponding database labels. Here, we automatically identify at least 491 missing CATH topology annotations with our pipeline. Third, they can maintain structural consistency across unseen proteins. Of our 10,240 SAE features, we find 615 that are consistently structurally similar in unannotated metagenomic proteins, allowing us to structurally match at least 8,077 metagenomic proteins to characterized proteins. This provides a rapid annotation pipeline with constant time search regardless of database size, that automatically includes structural and function information about the feature that triggered the match. } }
Endnote
%0 Conference Paper %T Towards functional annotation with latent protein language model features %A Jacob Silberg %A Elana Simon %A James Zou %B Proceedings of the 20th Machine Learning in Computational Biology meeting %C Proceedings of Machine Learning Research %D 2025 %E David A Knowles %E Peter K Koo %F pmlr-v311-silberg25a %I PMLR %P 280--293 %U https://proceedings.mlr.press/v311/silberg25a.html %V 311 %X Protein Language Models (PLMs) create high-dimensional embeddings that can be transformed into interpretable sparse features using Sparse Autoencoders (SAEs), where each feature activates on specific protein elements or patterns. However, scalably identifying which features are cohesive and reliable enough for protein annotation remains challenging. We address this by developing a validation pipeline combining three complementary methods: (1) expanded database matching across 20+ annotation sources including hierarchical codes, (2) feature-guided local structural alignment to identify structurally consistent activation regions, and (3) LLM-based feature description generation. Our annotation pipeline demonstrates three key properties of SAE features that make them a useful source of functional annotation complementary to existing methods. First, they can represent more granular patterns than existing protein databases, enabling the identification of sub-domains. Second, they can detect missing annotations by finding proteins that display recognizable structural motifs but lack corresponding database labels. Here, we automatically identify at least 491 missing CATH topology annotations with our pipeline. Third, they can maintain structural consistency across unseen proteins. Of our 10,240 SAE features, we find 615 that are consistently structurally similar in unannotated metagenomic proteins, allowing us to structurally match at least 8,077 metagenomic proteins to characterized proteins. This provides a rapid annotation pipeline with constant time search regardless of database size, that automatically includes structural and function information about the feature that triggered the match.
APA
Silberg, J., Simon, E. & Zou, J.. (2025). Towards functional annotation with latent protein language model features. Proceedings of the 20th Machine Learning in Computational Biology meeting, in Proceedings of Machine Learning Research 311:280-293 Available from https://proceedings.mlr.press/v311/silberg25a.html.

Related Material