Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, Cheston Tan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42213-42244, 2024.

Abstract

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design QUAG (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design QUAG-attention, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models’ abilities to learn highly-coupled multimodal representations. Hence, we design the CLAVI (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-rawal24a, title = {Dissecting Multimodality in {V}ideo{QA} Transformer Models by Impairing Modality Fusion}, author = {Rawal, Ishaan Singh and Matyasko, Alexander and Jaiswal, Shantanu and Fernando, Basura and Tan, Cheston}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {42213--42244}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/rawal24a/rawal24a.pdf}, url = {https://proceedings.mlr.press/v235/rawal24a.html}, abstract = {While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design QUAG (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design QUAG-attention, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models’ abilities to learn highly-coupled multimodal representations. Hence, we design the CLAVI (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets.} }
Endnote
%0 Conference Paper %T Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion %A Ishaan Singh Rawal %A Alexander Matyasko %A Shantanu Jaiswal %A Basura Fernando %A Cheston Tan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-rawal24a %I PMLR %P 42213--42244 %U https://proceedings.mlr.press/v235/rawal24a.html %V 235 %X While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design QUAG (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design QUAG-attention, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models’ abilities to learn highly-coupled multimodal representations. Hence, we design the CLAVI (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets.
APA
Rawal, I.S., Matyasko, A., Jaiswal, S., Fernando, B. & Tan, C.. (2024). Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:42213-42244 Available from https://proceedings.mlr.press/v235/rawal24a.html.

Related Material