[edit]
Neural Attention Maps Alignment in Vision Transformers and Mammalian Visual Cortex
Proceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026, PMLR 308:163-179, 2026.
Abstract
Image parsing with Vision Transformers has achieved state-of-the-art results, but how these models process visual information compared to biological vision systems is an open question. In this study, we present an extensive benchmarking between the attention mechanisms in the Vision Transformer-based models, such as Segment Anything, and its several variants that capture long-range dependencies in understanding the generalized features in natural images, with the neural responses captured from the mouse visual cortex for the same visual inputs. We found a significant correspondence between self-attention and convolutional maps in these models and cortical neural activity in the mouse visual cortex. This trend is observed to be consistent across similar model architectures with varying numbers of parameter units and provides an explainable trade-off between the accuracy and efficiency on real-world object segmentation datasets. This relationship is observed to be generalized across the sub-regions and neuronal genotypes, capturing diverse functional units in the mouse visual cortex. Our work proposes a pioneering effort in identifying important parallels between hierarchical representational learning in vision-based transformers and the biological visual cortex. To advance the development of neuro-AI models, these neural correlates suggest that aspects of cortical computation, captured by the state-of-the-art vision models, can potentially contribute to their effectiveness for image understanding tasks as well as guiding the advancement of novel model architecture design. We anticipate that this practice will also lead to future interpretability work to better understand the encoding and decoding principles of computation in the mammalian visual cortex.