[edit]
$\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:68870-68888, 2025.
Abstract
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.