Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity

Jinze Zhao; Peihao Wang; Junjie Yang; Ruisi Cai; Gaowen Liu; Jayanth Srinivasa; Ramana Rao Kompella; Yingbin Liang; Zhangyang Wang

Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity

Jinze Zhao, Peihao Wang, Junjie Yang, Ruisi Cai, Gaowen Liu, Jayanth Srinivasa, Ramana Rao Kompella, Yingbin Liang, Zhangyang Wang

Conference on Parsimony and Learning, PMLR 328:1049-1071, 2026.

Abstract

Sparse Mixture-of-Experts (SMoE) architectures have gained prominence for their ability to scale neural networks, particularly transformers, without a proportional increase in computational cost. Despite their success, their role in compositional generalization, i.e., adapting to novel combinations of known components, remains under-explored. This study challenges the assumption that minimal expert activation suffices for task generalization and investigates the relationship between task complexity and optimal sparsity in SMoE models. Through empirical evaluations on the SRAVEN symbolic reasoning task and the SKILL-MIX benchmark, we demonstrate that (i) the number of activated experts consistently increases with the perceived task difficulty to maintain performance; and (ii) the optimal number of activated experts scales proportionally with task complexity. Our theoretical analysis derives a scaling law for optimal sparsity by balancing approximation and estimation errors, revealing alignment with empirical observations. We formally show that the optimal sparsity lies between minimal activation (1-2 experts) and full activation, with the exact number scaling proportionally to task complexity and further influenced by the size of the training data and the complexity of the model. These findings offer practical insights for designing SMoE models that achieve computational efficiency while enabling robust compositional generalization.

Cite this Paper

BibTeX

@InProceedings{pmlr-v328-zhao26b,
  title = 	 {Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity},
  author =       {Zhao, Jinze and Wang, Peihao and Yang, Junjie and Cai, Ruisi and Liu, Gaowen and Srinivasa, Jayanth and Kompella, Ramana Rao and Liang, Yingbin and Wang, Zhangyang},
  booktitle = 	 {Conference on Parsimony and Learning},
  pages = 	 {1049--1071},
  year = 	 {2026},
  editor = 	 {Burkholz, Rebekka and Liu, Shiwei and Ravishankar, Saiprasad and Redman, William and Huang, Wei and Su, Weijie and Zhu, Zhihui},
  volume = 	 {328},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--26 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v328/main/assets/zhao26b/zhao26b.pdf},
  url = 	 {https://proceedings.mlr.press/v328/zhao26b.html},
  abstract = 	 {Sparse Mixture-of-Experts (SMoE) architectures have gained prominence for their ability to scale neural networks, particularly transformers, without a proportional increase in computational cost. Despite their success, their role in compositional generalization, i.e., adapting to novel combinations of known components, remains under-explored. This study challenges the assumption that minimal expert activation suffices for task generalization and investigates the relationship between task complexity and optimal sparsity in SMoE models. Through empirical evaluations on the SRAVEN symbolic reasoning task and the SKILL-MIX benchmark, we demonstrate that (i) the number of activated experts consistently increases with the perceived task difficulty to maintain performance; and (ii) the optimal number of activated experts scales proportionally with task complexity. Our theoretical analysis derives a scaling law for optimal sparsity by balancing approximation and estimation errors, revealing alignment with empirical observations. We formally show that the optimal sparsity lies between minimal activation (1-2 experts) and full activation, with the exact number scaling proportionally to task complexity and further influenced by the size of the training data and the complexity of the model. These findings offer practical insights for designing SMoE models that achieve computational efficiency while enabling robust compositional generalization.}
}

Endnote

%0 Conference Paper
%T Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity
%A Jinze Zhao
%A Peihao Wang
%A Junjie Yang
%A Ruisi Cai
%A Gaowen Liu
%A Jayanth Srinivasa
%A Ramana Rao Kompella
%A Yingbin Liang
%A Zhangyang Wang
%B Conference on Parsimony and Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Rebekka Burkholz
%E Shiwei Liu
%E Saiprasad Ravishankar
%E William Redman
%E Wei Huang
%E Weijie Su
%E Zhihui Zhu	
%F pmlr-v328-zhao26b
%I PMLR
%P 1049--1071
%U https://proceedings.mlr.press/v328/zhao26b.html
%V 328
%X Sparse Mixture-of-Experts (SMoE) architectures have gained prominence for their ability to scale neural networks, particularly transformers, without a proportional increase in computational cost. Despite their success, their role in compositional generalization, i.e., adapting to novel combinations of known components, remains under-explored. This study challenges the assumption that minimal expert activation suffices for task generalization and investigates the relationship between task complexity and optimal sparsity in SMoE models. Through empirical evaluations on the SRAVEN symbolic reasoning task and the SKILL-MIX benchmark, we demonstrate that (i) the number of activated experts consistently increases with the perceived task difficulty to maintain performance; and (ii) the optimal number of activated experts scales proportionally with task complexity. Our theoretical analysis derives a scaling law for optimal sparsity by balancing approximation and estimation errors, revealing alignment with empirical observations. We formally show that the optimal sparsity lies between minimal activation (1-2 experts) and full activation, with the exact number scaling proportionally to task complexity and further influenced by the size of the training data and the complexity of the model. These findings offer practical insights for designing SMoE models that achieve computational efficiency while enabling robust compositional generalization.

APA

Zhao, J., Wang, P., Yang, J., Cai, R., Liu, G., Srinivasa, J., Kompella, R.R., Liang, Y. & Wang, Z.. (2026). Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 328:1049-1071 Available from https://proceedings.mlr.press/v328/zhao26b.html.

Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity

Abstract

Cite this Paper

Related Material