A Closer Look at Multimodal Representation Collapse

Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:7555-7577, 2025.

Abstract

We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chaudhuri25a, title = {A Closer Look at Multimodal Representation Collapse}, author = {Chaudhuri, Abhra and Dutta, Anjan and Bui, Tu and Georgescu, Serban}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {7555--7577}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chaudhuri25a/chaudhuri25a.pdf}, url = {https://proceedings.mlr.press/v267/chaudhuri25a.html}, abstract = {We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.} }
Endnote
%0 Conference Paper %T A Closer Look at Multimodal Representation Collapse %A Abhra Chaudhuri %A Anjan Dutta %A Tu Bui %A Serban Georgescu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chaudhuri25a %I PMLR %P 7555--7577 %U https://proceedings.mlr.press/v267/chaudhuri25a.html %V 267 %X We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.
APA
Chaudhuri, A., Dutta, A., Bui, T. & Georgescu, S.. (2025). A Closer Look at Multimodal Representation Collapse. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:7555-7577 Available from https://proceedings.mlr.press/v267/chaudhuri25a.html.

Related Material