Neural Attention Maps Alignment in Vision Transformers and Mammalian Visual Cortex

Hamd Jalil, Ahmed Rashid Qazi, Asim Iqbal
Proceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026, PMLR 308:163-179, 2026.

Abstract

Image parsing with Vision Transformers has achieved state-of-the-art results, but how these models process visual information compared to biological vision systems is an open question. In this study, we present an extensive benchmarking between the attention mechanisms in the Vision Transformer-based models, such as Segment Anything, and its several variants that capture long-range dependencies in understanding the generalized features in natural images, with the neural responses captured from the mouse visual cortex for the same visual inputs. We found a significant correspondence between self-attention and convolutional maps in these models and cortical neural activity in the mouse visual cortex. This trend is observed to be consistent across similar model architectures with varying numbers of parameter units and provides an explainable trade-off between the accuracy and efficiency on real-world object segmentation datasets. This relationship is observed to be generalized across the sub-regions and neuronal genotypes, capturing diverse functional units in the mouse visual cortex. Our work proposes a pioneering effort in identifying important parallels between hierarchical representational learning in vision-based transformers and the biological visual cortex. To advance the development of neuro-AI models, these neural correlates suggest that aspects of cortical computation, captured by the state-of-the-art vision models, can potentially contribute to their effectiveness for image understanding tasks as well as guiding the advancement of novel model architecture design. We anticipate that this practice will also lead to future interpretability work to better understand the encoding and decoding principles of computation in the mammalian visual cortex.

Cite this Paper


BibTeX
@InProceedings{pmlr-v308-jalil26a, title = {Neural Attention Maps Alignment in Vision Transformers and Mammalian Visual Cortex}, author = {Jalil, Hamd and Qazi, Ahmed Rashid and Iqbal, Asim}, booktitle = {Proceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026}, pages = {163--179}, year = {2026}, editor = {Abbasi-Asl, Reza and Iqbal, Asim and Ito, Shinya and Arkhipov, Anton and Sanborn, Sophia}, volume = {308}, series = {Proceedings of Machine Learning Research}, month = {27 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v308/main/assets/jalil26a/jalil26a.pdf}, url = {https://proceedings.mlr.press/v308/jalil26a.html}, abstract = {Image parsing with Vision Transformers has achieved state-of-the-art results, but how these models process visual information compared to biological vision systems is an open question. In this study, we present an extensive benchmarking between the attention mechanisms in the Vision Transformer-based models, such as Segment Anything, and its several variants that capture long-range dependencies in understanding the generalized features in natural images, with the neural responses captured from the mouse visual cortex for the same visual inputs. We found a significant correspondence between self-attention and convolutional maps in these models and cortical neural activity in the mouse visual cortex. This trend is observed to be consistent across similar model architectures with varying numbers of parameter units and provides an explainable trade-off between the accuracy and efficiency on real-world object segmentation datasets. This relationship is observed to be generalized across the sub-regions and neuronal genotypes, capturing diverse functional units in the mouse visual cortex. Our work proposes a pioneering effort in identifying important parallels between hierarchical representational learning in vision-based transformers and the biological visual cortex. To advance the development of neuro-AI models, these neural correlates suggest that aspects of cortical computation, captured by the state-of-the-art vision models, can potentially contribute to their effectiveness for image understanding tasks as well as guiding the advancement of novel model architecture design. We anticipate that this practice will also lead to future interpretability work to better understand the encoding and decoding principles of computation in the mammalian visual cortex.} }
Endnote
%0 Conference Paper %T Neural Attention Maps Alignment in Vision Transformers and Mammalian Visual Cortex %A Hamd Jalil %A Ahmed Rashid Qazi %A Asim Iqbal %B Proceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026 %C Proceedings of Machine Learning Research %D 2026 %E Reza Abbasi-Asl %E Asim Iqbal %E Shinya Ito %E Anton Arkhipov %E Sophia Sanborn %F pmlr-v308-jalil26a %I PMLR %P 163--179 %U https://proceedings.mlr.press/v308/jalil26a.html %V 308 %X Image parsing with Vision Transformers has achieved state-of-the-art results, but how these models process visual information compared to biological vision systems is an open question. In this study, we present an extensive benchmarking between the attention mechanisms in the Vision Transformer-based models, such as Segment Anything, and its several variants that capture long-range dependencies in understanding the generalized features in natural images, with the neural responses captured from the mouse visual cortex for the same visual inputs. We found a significant correspondence between self-attention and convolutional maps in these models and cortical neural activity in the mouse visual cortex. This trend is observed to be consistent across similar model architectures with varying numbers of parameter units and provides an explainable trade-off between the accuracy and efficiency on real-world object segmentation datasets. This relationship is observed to be generalized across the sub-regions and neuronal genotypes, capturing diverse functional units in the mouse visual cortex. Our work proposes a pioneering effort in identifying important parallels between hierarchical representational learning in vision-based transformers and the biological visual cortex. To advance the development of neuro-AI models, these neural correlates suggest that aspects of cortical computation, captured by the state-of-the-art vision models, can potentially contribute to their effectiveness for image understanding tasks as well as guiding the advancement of novel model architecture design. We anticipate that this practice will also lead to future interpretability work to better understand the encoding and decoding principles of computation in the mammalian visual cortex.
APA
Jalil, H., Qazi, A.R. & Iqbal, A.. (2026). Neural Attention Maps Alignment in Vision Transformers and Mammalian Visual Cortex. Proceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026, in Proceedings of Machine Learning Research 308:163-179 Available from https://proceedings.mlr.press/v308/jalil26a.html.

Related Material