Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning

Jie Peng; Sukwon Yun; Kaixiong Zhou; Ruida Zhou; Thomas Hartvigsen; Yanyong Zhang; Zhangyang Wang; Tianlong Chen

Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning

Jie Peng, Sukwon Yun, Kaixiong Zhou, Ruida Zhou, Thomas Hartvigsen, Yanyong Zhang, Zhangyang Wang, Tianlong Chen

Conference on Parsimony and Learning, PMLR 280:1112-1145, 2025.

Abstract

Sparse Mixture-of-Experts (SMoE) is a promising paradigm that can be easily tailored for multi-task learning. Its conditional computing nature allows us to organically allocate relevant parts of a model for performant and efficient predictions. However, several under-explored pain points persist, especially when considering scenarios with both multiple modalities and tasks: 1. $\textit{{Modality Forgetting Issue.}}$ Diverse modalities may prefer conflicting optimization directions, resulting in ineffective learning or knowledge forgetting; 2. $\textit{{Modality Fitting Issue.}}$ Current SMoE pipelines select a fixed number of experts for all modalities, which can end up over-fitting to simpler modalities or under-fitting complex modalities; 3. $\textit{{Heterogeneous Learning Pace.}}$ The varied modality attributes, task resources, and objectives usually lead to distinct optimization difficulties and convergence. Given these issues, there is a clear need for a systematic approach to harmonizing multi-modal and multi-task objectives when using SMoE. We aim to address these pain points and propose a new $\underline{S}$parse $\underline{M}$oE for $\underline{M}$ulti-$\underline{M}$odal $\underline{M}$ulti-task learning, $\textit{a.k.a.}$, $\texttt{SM$^4$}$, which ($1$) disentangles model spaces for different modalities to mitigate their optimization conflicts; ($2$) automatically determines the modality-specific model size to improve fitting; and ($3$) synchronizes the learning paces of disparate modalities and tasks based on training dynamics in SMoE like the entropy of routing decisions. Comprehensive experiments validate the effectiveness of $\texttt{SM$^4$}$, which outperforms previous state-of-the-art across $3$ task groups and $11$ different modalities with a clear performance margin ($\textit{e.g.}$, $\ge 1.37%$) and a substantial computation reduction ($46.49% \sim 98.62%$). Codes are in supplement.

Cite this Paper

BibTeX

@InProceedings{pmlr-v280-peng25a,
  title = 	 {Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning},
  author =       {Peng, Jie and Yun, Sukwon and Zhou, Kaixiong and Zhou, Ruida and Hartvigsen, Thomas and Zhang, Yanyong and Wang, Zhangyang and Chen, Tianlong},
  booktitle = 	 {Conference on Parsimony and Learning},
  pages = 	 {1112--1145},
  year = 	 {2025},
  editor = 	 {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui},
  volume = 	 {280},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {24--27 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v280/main/assets/peng25a/peng25a.pdf},
  url = 	 {https://proceedings.mlr.press/v280/peng25a.html},
  abstract = 	 {Sparse Mixture-of-Experts (SMoE) is a promising paradigm that can be easily tailored for multi-task learning. Its conditional computing nature allows us to organically allocate relevant parts of a model for performant and efficient predictions. However, several under-explored pain points persist, especially when considering scenarios with both multiple modalities and tasks: 1. $\textit{{Modality Forgetting Issue.}}$ Diverse modalities may prefer conflicting optimization directions, resulting in ineffective learning or knowledge forgetting; 2. $\textit{{Modality Fitting Issue.}}$ Current SMoE pipelines select a fixed number of experts for all modalities, which can end up over-fitting to simpler modalities or under-fitting complex modalities; 3. $\textit{{Heterogeneous Learning Pace.}}$ The varied modality attributes, task resources, and objectives usually lead to distinct optimization difficulties and convergence. Given these issues, there is a clear need for a systematic approach to harmonizing multi-modal and multi-task objectives when using SMoE. We aim to address these pain points and propose a new $\underline{S}$parse $\underline{M}$oE for $\underline{M}$ulti-$\underline{M}$odal $\underline{M}$ulti-task learning, $\textit{a.k.a.}$, $\texttt{SM$^4$}$, which ($1$) disentangles model spaces for different modalities to mitigate their optimization conflicts;  ($2$) automatically determines the modality-specific model size to improve fitting; and ($3$) synchronizes the learning paces of disparate modalities and tasks based on training dynamics in SMoE like the entropy of routing decisions. Comprehensive experiments validate the effectiveness of $\texttt{SM$^4$}$, which outperforms previous state-of-the-art across $3$ task groups and $11$ different modalities with a clear performance margin ($\textit{e.g.}$, $\ge 1.37%$) and a substantial computation reduction ($46.49% \sim 98.62%$). Codes are in supplement.}
}

Endnote

%0 Conference Paper
%T Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning
%A Jie Peng
%A Sukwon Yun
%A Kaixiong Zhou
%A Ruida Zhou
%A Thomas Hartvigsen
%A Yanyong Zhang
%A Zhangyang Wang
%A Tianlong Chen
%B Conference on Parsimony and Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Beidi Chen
%E Shijia Liu
%E Mert Pilanci
%E Weijie Su
%E Jeremias Sulam
%E Yuxiang Wang
%E Zhihui Zhu	
%F pmlr-v280-peng25a
%I PMLR
%P 1112--1145
%U https://proceedings.mlr.press/v280/peng25a.html
%V 280
%X Sparse Mixture-of-Experts (SMoE) is a promising paradigm that can be easily tailored for multi-task learning. Its conditional computing nature allows us to organically allocate relevant parts of a model for performant and efficient predictions. However, several under-explored pain points persist, especially when considering scenarios with both multiple modalities and tasks: 1. $\textit{{Modality Forgetting Issue.}}$ Diverse modalities may prefer conflicting optimization directions, resulting in ineffective learning or knowledge forgetting; 2. $\textit{{Modality Fitting Issue.}}$ Current SMoE pipelines select a fixed number of experts for all modalities, which can end up over-fitting to simpler modalities or under-fitting complex modalities; 3. $\textit{{Heterogeneous Learning Pace.}}$ The varied modality attributes, task resources, and objectives usually lead to distinct optimization difficulties and convergence. Given these issues, there is a clear need for a systematic approach to harmonizing multi-modal and multi-task objectives when using SMoE. We aim to address these pain points and propose a new $\underline{S}$parse $\underline{M}$oE for $\underline{M}$ulti-$\underline{M}$odal $\underline{M}$ulti-task learning, $\textit{a.k.a.}$, $\texttt{SM$^4$}$, which ($1$) disentangles model spaces for different modalities to mitigate their optimization conflicts;  ($2$) automatically determines the modality-specific model size to improve fitting; and ($3$) synchronizes the learning paces of disparate modalities and tasks based on training dynamics in SMoE like the entropy of routing decisions. Comprehensive experiments validate the effectiveness of $\texttt{SM$^4$}$, which outperforms previous state-of-the-art across $3$ task groups and $11$ different modalities with a clear performance margin ($\textit{e.g.}$, $\ge 1.37%$) and a substantial computation reduction ($46.49% \sim 98.62%$). Codes are in supplement.

APA

Peng, J., Yun, S., Zhou, K., Zhou, R., Hartvigsen, T., Zhang, Y., Wang, Z. & Chen, T.. (2025). Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:1112-1145 Available from https://proceedings.mlr.press/v280/peng25a.html.

Sparse MoE as a New Treatment: Addressing Forgetting, Fitting, Learning Issues in Multi-Modal Multi-Task Learning

Abstract

Cite this Paper

Related Material