Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization

Adrian Javaloy, Maryam Meghdadi, Isabel Valera
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9938-9964, 2022.

Abstract

A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-javaloy22a, title = {Mitigating Modality Collapse in Multimodal {VAE}s via Impartial Optimization}, author = {Javaloy, Adrian and Meghdadi, Maryam and Valera, Isabel}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {9938--9964}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/javaloy22a/javaloy22a.pdf}, url = {https://proceedings.mlr.press/v162/javaloy22a.html}, abstract = {A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.} }
Endnote
%0 Conference Paper %T Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization %A Adrian Javaloy %A Maryam Meghdadi %A Isabel Valera %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-javaloy22a %I PMLR %P 9938--9964 %U https://proceedings.mlr.press/v162/javaloy22a.html %V 162 %X A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
APA
Javaloy, A., Meghdadi, M. & Valera, I.. (2022). Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:9938-9964 Available from https://proceedings.mlr.press/v162/javaloy22a.html.

Related Material