Vision and language representations in multimodal AI models and human social brain regions during natural movie viewing

Hannah Small, Haemy Lee Masson, Stewart Mostofsky, Leyla Isik
Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models, PMLR 285:69-84, 2024.

Abstract

Recent work in NeuroAI suggests that representations in modern AI vision and language models are highly aligned with each other and human visual cortex. In addition, training AI vision models on language-aligned tasks (e.g., CLIP-style models) improves their match to visual cortex, particularly in regions involved in social perception, suggesting these brain regions may be similarly "language aligned". This prior work has primarily investigated only static stimuli without language, but in our daily lives, we experience the dynamic visual world and communicate about it using language simultaneously. To understand the processing of vision and language during natural viewing, we fit an encoding model to predict voxel-wise responses to an audiovisual movie using visual representations from both purely visual and language-aligned vision transformer models and paired language transformers. We first find that in naturalistic settings, there is remarkably low correlation between representations in vision and language models and both predict social perceptual and language regions well. Next, we find that language-alignment does not improve a vision model embedding’s match to neural responses in social perceptual regions, despite these regions being well predicted by both vision and language embeddings. Preliminary analyses, however, suggest that vision-alignment does improve a language model’s ability to match neural responses in language regions during audiovisual processing. Our work demonstrates the importance of testing multimodal AI models in naturalistic settings and reveals differences between language alignment in modern AI models and the human brain.

Cite this Paper


BibTeX
@InProceedings{pmlr-v285-small24a, title = {Vision and language representations in multimodal {AI} models and human social brain regions during natural movie viewing}, author = {Small, Hannah and Masson, Haemy Lee and Mostofsky, Stewart and Isik, Leyla}, booktitle = {Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models}, pages = {69--84}, year = {2024}, editor = {Fumero, Marco and Domine, Clementine and Lähner, Zorah and Crisostomi, Donato and Moschella, Luca and Stachenfeld, Kimberly}, volume = {285}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v285/main/assets/small24a/small24a.pdf}, url = {https://proceedings.mlr.press/v285/small24a.html}, abstract = {Recent work in NeuroAI suggests that representations in modern AI vision and language models are highly aligned with each other and human visual cortex. In addition, training AI vision models on language-aligned tasks (e.g., CLIP-style models) improves their match to visual cortex, particularly in regions involved in social perception, suggesting these brain regions may be similarly "language aligned". This prior work has primarily investigated only static stimuli without language, but in our daily lives, we experience the dynamic visual world and communicate about it using language simultaneously. To understand the processing of vision and language during natural viewing, we fit an encoding model to predict voxel-wise responses to an audiovisual movie using visual representations from both purely visual and language-aligned vision transformer models and paired language transformers. We first find that in naturalistic settings, there is remarkably low correlation between representations in vision and language models and both predict social perceptual and language regions well. Next, we find that language-alignment does not improve a vision model embedding’s match to neural responses in social perceptual regions, despite these regions being well predicted by both vision and language embeddings. Preliminary analyses, however, suggest that vision-alignment does improve a language model’s ability to match neural responses in language regions during audiovisual processing. Our work demonstrates the importance of testing multimodal AI models in naturalistic settings and reveals differences between language alignment in modern AI models and the human brain.} }
Endnote
%0 Conference Paper %T Vision and language representations in multimodal AI models and human social brain regions during natural movie viewing %A Hannah Small %A Haemy Lee Masson %A Stewart Mostofsky %A Leyla Isik %B Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2024 %E Marco Fumero %E Clementine Domine %E Zorah Lähner %E Donato Crisostomi %E Luca Moschella %E Kimberly Stachenfeld %F pmlr-v285-small24a %I PMLR %P 69--84 %U https://proceedings.mlr.press/v285/small24a.html %V 285 %X Recent work in NeuroAI suggests that representations in modern AI vision and language models are highly aligned with each other and human visual cortex. In addition, training AI vision models on language-aligned tasks (e.g., CLIP-style models) improves their match to visual cortex, particularly in regions involved in social perception, suggesting these brain regions may be similarly "language aligned". This prior work has primarily investigated only static stimuli without language, but in our daily lives, we experience the dynamic visual world and communicate about it using language simultaneously. To understand the processing of vision and language during natural viewing, we fit an encoding model to predict voxel-wise responses to an audiovisual movie using visual representations from both purely visual and language-aligned vision transformer models and paired language transformers. We first find that in naturalistic settings, there is remarkably low correlation between representations in vision and language models and both predict social perceptual and language regions well. Next, we find that language-alignment does not improve a vision model embedding’s match to neural responses in social perceptual regions, despite these regions being well predicted by both vision and language embeddings. Preliminary analyses, however, suggest that vision-alignment does improve a language model’s ability to match neural responses in language regions during audiovisual processing. Our work demonstrates the importance of testing multimodal AI models in naturalistic settings and reveals differences between language alignment in modern AI models and the human brain.
APA
Small, H., Masson, H.L., Mostofsky, S. & Isik, L.. (2024). Vision and language representations in multimodal AI models and human social brain regions during natural movie viewing. Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 285:69-84 Available from https://proceedings.mlr.press/v285/small24a.html.

Related Material