Superposition disentanglement of neural representations reveals hidden alignment

André Longon, David Klindt, Meenakshi Khosla
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:342-359, 2026.

Abstract

The superposition hypothesis states that single neurons may participate in representing multiple features in order for the neural network to represent more features than it has neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: does superposition interact with alignment metrics in any undesirable way? We hypothesize that models which represent the same features in different superposition arrangements, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We develop a theory for how permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model’s base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN-DNN and DNN-brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural networks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-longon26a, title = {Superposition disentanglement of neural representations reveals hidden alignment}, author = {Longon, Andr\'{e} and Klindt, David and Khosla, Meenakshi}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {342--359}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/longon26a/longon26a.pdf}, url = {https://proceedings.mlr.press/v322/longon26a.html}, abstract = {The superposition hypothesis states that single neurons may participate in representing multiple features in order for the neural network to represent more features than it has neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: does superposition interact with alignment metrics in any undesirable way? We hypothesize that models which represent the same features in different superposition arrangements, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We develop a theory for how permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model’s base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN-DNN and DNN-brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural networks.} }
Endnote
%0 Conference Paper %T Superposition disentanglement of neural representations reveals hidden alignment %A André Longon %A David Klindt %A Meenakshi Khosla %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-longon26a %I PMLR %P 342--359 %U https://proceedings.mlr.press/v322/longon26a.html %V 322 %X The superposition hypothesis states that single neurons may participate in representing multiple features in order for the neural network to represent more features than it has neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: does superposition interact with alignment metrics in any undesirable way? We hypothesize that models which represent the same features in different superposition arrangements, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We develop a theory for how permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model’s base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN-DNN and DNN-brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural networks.
APA
Longon, A., Klindt, D. & Khosla, M.. (2026). Superposition disentanglement of neural representations reveals hidden alignment. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:342-359 Available from https://proceedings.mlr.press/v322/longon26a.html.

Related Material