$\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Jiayi Xin; Sukwon Yun; Jie Peng; Inyoung Choi; Jenna L. Ballard; Tianlong Chen; Qi Long

$\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:68870-68888, 2025.

Abstract

Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-xin25c,
  title = 	 {$\texttt{I$^2$MoE}$: Interpretable Multimodal Interaction-aware Mixture-of-Experts},
  author =       {Xin, Jiayi and Yun, Sukwon and Peng, Jie and Choi, Inyoung and Ballard, Jenna L. and Chen, Tianlong and Long, Qi},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {68870--68888},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xin25c/xin25c.pdf},
  url = 	 {https://proceedings.mlr.press/v267/xin25c.html},
  abstract = 	 {Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.}
}

Endnote

%0 Conference Paper
%T $\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts
%A Jiayi Xin
%A Sukwon Yun
%A Jie Peng
%A Inyoung Choi
%A Jenna L. Ballard
%A Tianlong Chen
%A Qi Long
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-xin25c
%I PMLR
%P 68870--68888
%U https://proceedings.mlr.press/v267/xin25c.html
%V 267
%X Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.

APA

Xin, J., Yun, S., Peng, J., Choi, I., Ballard, J.L., Chen, T. & Long, Q.. (2025). $\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:68870-68888 Available from https://proceedings.mlr.press/v267/xin25c.html.

$\textttI$^2$MoE$: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Abstract

Cite this Paper

Related Material