Revealing Vision-Language Integration in the Brain with Multimodal Networks

Vighnesh Subramaniam; Colin Conwell; Christopher Wang; Gabriel Kreiman; Boris Katz; Ignacio Cases; Andrei Barbu

Revealing Vision-Language Integration in the Brain with Multimodal Networks

Vighnesh Subramaniam, Colin Conwell, Christopher Wang, Gabriel Kreiman, Boris Katz, Ignacio Cases, Andrei Barbu

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:46868-46890, 2024.

Abstract

We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-subramaniam24a,
  title = 	 {Revealing Vision-Language Integration in the Brain with Multimodal Networks},
  author =       {Subramaniam, Vighnesh and Conwell, Colin and Wang, Christopher and Kreiman, Gabriel and Katz, Boris and Cases, Ignacio and Barbu, Andrei},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {46868--46890},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/subramaniam24a/subramaniam24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/subramaniam24a.html},
  abstract = 	 {We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.}
}

Endnote

%0 Conference Paper
%T Revealing Vision-Language Integration in the Brain with Multimodal Networks
%A Vighnesh Subramaniam
%A Colin Conwell
%A Christopher Wang
%A Gabriel Kreiman
%A Boris Katz
%A Ignacio Cases
%A Andrei Barbu
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-subramaniam24a
%I PMLR
%P 46868--46890
%U https://proceedings.mlr.press/v235/subramaniam24a.html
%V 235
%X We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.

APA

Subramaniam, V., Conwell, C., Wang, C., Kreiman, G., Katz, B., Cases, I. & Barbu, A.. (2024). Revealing Vision-Language Integration in the Brain with Multimodal Networks. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:46868-46890 Available from https://proceedings.mlr.press/v235/subramaniam24a.html.

Revealing Vision-Language Integration in the Brain with Multimodal Networks

Abstract

Cite this Paper

Related Material