Retraining-free Merging of Sparse MoE via Hierarchical Clustering

I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8594-8620, 2025.

Abstract

Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for largescale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available at https://github.com/wazenmai/HC-SMoE.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25aq, title = {Retraining-free Merging of Sparse {M}o{E} via Hierarchical Clustering}, author = {Chen, I-Chun and Liu, Hsu-Shen and Sun, Wei-Fang and Chao, Chen-Hao and Hsu, Yen-Chang and Lee, Chun-Yi}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {8594--8620}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25aq/chen25aq.pdf}, url = {https://proceedings.mlr.press/v267/chen25aq.html}, abstract = {Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for largescale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available at https://github.com/wazenmai/HC-SMoE.} }
Endnote
%0 Conference Paper %T Retraining-free Merging of Sparse MoE via Hierarchical Clustering %A I-Chun Chen %A Hsu-Shen Liu %A Wei-Fang Sun %A Chen-Hao Chao %A Yen-Chang Hsu %A Chun-Yi Lee %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25aq %I PMLR %P 8594--8620 %U https://proceedings.mlr.press/v267/chen25aq.html %V 267 %X Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for largescale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available at https://github.com/wazenmai/HC-SMoE.
APA
Chen, I., Liu, H., Sun, W., Chao, C., Hsu, Y. & Lee, C.. (2025). Retraining-free Merging of Sparse MoE via Hierarchical Clustering. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8594-8620 Available from https://proceedings.mlr.press/v267/chen25aq.html.

Related Material