Domain-Specific Expert Pruning for Mixture-of-Experts LLMs

Juntao Yao; Huiyuan Zheng; Boyang Wang; Xiaohu Yu; Yibo Li; Shaosheng Cao; Donglin Di; Boyan Wang; Haoyun Zheng; Jinze Yu; Anjie Le; Hongcheng Guo

Domain-Specific Expert Pruning for Mixture-of-Experts LLMs

Juntao Yao, Huiyuan Zheng, Boyang Wang, Xiaohu Yu, Yibo Li, Shaosheng Cao, Donglin Di, Boyan Wang, Haoyun Zheng, Jinze Yu, Anjie Le, Hongcheng Guo

Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:313-321, 2026.

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Domain-Specific Expert Pruning (C-PRUNE), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-PRUNE operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-PRUNE through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-PRUNE effectively reduces model size while outperforming existing MoE pruning methods. The effectiveness is observed across diverse domains, with notable performance in the medical field. We provide code. https://github.com/Fighoture/ MoE_unsupervised_pruning

Cite this Paper

BibTeX

@InProceedings{pmlr-v317-yao26a,
  title = 	 {Domain-Specific Expert Pruning for Mixture-of-Experts LLMs},
  author =       {Yao, Juntao and Zheng, Huiyuan and Wang, Boyang and Yu, Xiaohu and Li, Yibo and Cao, Shaosheng and Di, Donglin and Wang, Boyan and Zheng, Haoyun and Yu, Jinze and Le, Anjie and Guo, Hongcheng},
  booktitle = 	 {Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare},
  pages = 	 {313--321},
  year = 	 {2026},
  editor = 	 {Wu, Junde and Pan, Jiazhen and Zhu, Jiayuan and Luo, Luyang and Li, Yitong and Xu, Min and Jin, Yueming and Rueckert, Daniel},
  volume = 	 {317},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {20--21 Jan},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v317/main/assets/yao26a/yao26a.pdf},
  url = 	 {https://proceedings.mlr.press/v317/yao26a.html},
  abstract = 	 {Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Domain-Specific Expert Pruning (C-PRUNE), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-PRUNE operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-PRUNE through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-PRUNE effectively reduces model size while outperforming existing MoE pruning methods. The effectiveness is observed across diverse domains, with notable performance in the medical field. We provide code. https://github.com/Fighoture/ MoE_unsupervised_pruning}
}

Endnote

%0 Conference Paper
%T Domain-Specific Expert Pruning for Mixture-of-Experts LLMs
%A Juntao Yao
%A Huiyuan Zheng
%A Boyang Wang
%A Xiaohu Yu
%A Yibo Li
%A Shaosheng Cao
%A Donglin Di
%A Boyan Wang
%A Haoyun Zheng
%A Jinze Yu
%A Anjie Le
%A Hongcheng Guo
%B Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare
%C Proceedings of Machine Learning Research
%D 2026
%E Junde Wu
%E Jiazhen Pan
%E Jiayuan Zhu
%E Luyang Luo
%E Yitong Li
%E Min Xu
%E Yueming Jin
%E Daniel Rueckert	
%F pmlr-v317-yao26a
%I PMLR
%P 313--321
%U https://proceedings.mlr.press/v317/yao26a.html
%V 317
%X Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Domain-Specific Expert Pruning (C-PRUNE), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-PRUNE operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-PRUNE through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-PRUNE effectively reduces model size while outperforming existing MoE pruning methods. The effectiveness is observed across diverse domains, with notable performance in the medical field. We provide code. https://github.com/Fighoture/ MoE_unsupervised_pruning

APA

Yao, J., Zheng, H., Wang, B., Yu, X., Li, Y., Cao, S., Di, D., Wang, B., Zheng, H., Yu, J., Le, A. & Guo, H.. (2026). Domain-Specific Expert Pruning for Mixture-of-Experts LLMs. Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, in Proceedings of Machine Learning Research 317:313-321 Available from https://proceedings.mlr.press/v317/yao26a.html.

Related Material

Download PDF