A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8815-8847, 2024.

Abstract

The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks (experts), through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory/computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in fine-tuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the router’s $l_2$ norm from the pre-trained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as V-MoE and $\text{E}^3$-MoE fine-tuned on benchmark datasets such as CIFAR-10, CIFAR-100, and ImageNet.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-chowdhury24a, title = {A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts}, author = {Chowdhury, Mohammed Nowaz Rabbani and Wang, Meng and El Maghraoui, Kaoutar and Wang, Naigang and Chen, Pin-Yu and Carothers, Christopher}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {8815--8847}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/chowdhury24a/chowdhury24a.pdf}, url = {https://proceedings.mlr.press/v235/chowdhury24a.html}, abstract = {The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks (experts), through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory/computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in fine-tuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the router’s $l_2$ norm from the pre-trained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as V-MoE and $\text{E}^3$-MoE fine-tuned on benchmark datasets such as CIFAR-10, CIFAR-100, and ImageNet.} }
Endnote
%0 Conference Paper %T A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts %A Mohammed Nowaz Rabbani Chowdhury %A Meng Wang %A Kaoutar El Maghraoui %A Naigang Wang %A Pin-Yu Chen %A Christopher Carothers %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-chowdhury24a %I PMLR %P 8815--8847 %U https://proceedings.mlr.press/v235/chowdhury24a.html %V 235 %X The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks (experts), through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory/computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in fine-tuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the router’s $l_2$ norm from the pre-trained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as V-MoE and $\text{E}^3$-MoE fine-tuned on benchmark datasets such as CIFAR-10, CIFAR-100, and ImageNet.
APA
Chowdhury, M.N.R., Wang, M., El Maghraoui, K., Wang, N., Chen, P. & Carothers, C.. (2024). A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:8815-8847 Available from https://proceedings.mlr.press/v235/chowdhury24a.html.

Related Material