Robustness in Multimodal Learning under Train-Test Modality Mismatch

Brandon Mckinzie, Vaishaal Shankar, Joseph Yitan Cheng, Yinfei Yang, Jonathon Shlens, Alexander T Toshev
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:24291-24303, 2023.

Abstract

Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-mckinzie23a, title = {Robustness in Multimodal Learning under Train-Test Modality Mismatch}, author = {Mckinzie, Brandon and Shankar, Vaishaal and Cheng, Joseph Yitan and Yang, Yinfei and Shlens, Jonathon and Toshev, Alexander T}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {24291--24303}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/mckinzie23a/mckinzie23a.pdf}, url = {https://proceedings.mlr.press/v202/mckinzie23a.html}, abstract = {Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.} }
Endnote
%0 Conference Paper %T Robustness in Multimodal Learning under Train-Test Modality Mismatch %A Brandon Mckinzie %A Vaishaal Shankar %A Joseph Yitan Cheng %A Yinfei Yang %A Jonathon Shlens %A Alexander T Toshev %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-mckinzie23a %I PMLR %P 24291--24303 %U https://proceedings.mlr.press/v202/mckinzie23a.html %V 202 %X Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.
APA
Mckinzie, B., Shankar, V., Cheng, J.Y., Yang, Y., Shlens, J. & Toshev, A.T.. (2023). Robustness in Multimodal Learning under Train-Test Modality Mismatch. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:24291-24303 Available from https://proceedings.mlr.press/v202/mckinzie23a.html.

Related Material