Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:182-193, 2024.
Abstract
We introduce StructMoE, a method to scale MoE architectures by augmenting experts with dynamic capacity using structured matrices we call Low Rank Experts (LoRE). These LoREs are selected on a per-expert and per-token basis using a secondary router specific to every expert and are entangled with the main expert in the up-projection phase of the expert before the activation function. Empirically, we find this approach to outperform an MoE baseline in terms of loss on a held out validation set.
Cite this Paper
BibTeX
@InProceedings{pmlr-v262-sarwar24a,
title = {{StructMoE}: Structured Mixture of Experts Using Low Rank Experts},
author = {Sarwar, Zain and Panda, Ashwinee and Th\'erien, Benjamin and Rawls, Stephen and Das, Anirban and Balasubramaniam, Kartik and Kapusuzoglu, Berkcan and Zhang, Shixiong and Sahu, Sambit and Naphade, Milind and Chakraborty, Supriyo},
booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop},
pages = {182--193},
year = {2024},
editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing},
volume = {262},
series = {Proceedings of Machine Learning Research},
month = {14 Dec},
publisher = {PMLR},
pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/sarwar24a/sarwar24a.pdf},
url = {https://proceedings.mlr.press/v262/sarwar24a.html},
abstract = {We introduce StructMoE, a method to scale MoE architectures by augmenting experts with dynamic capacity using structured matrices we call Low Rank Experts (LoRE). These LoREs are selected on a per-expert and per-token basis using a secondary router specific to every expert and are entangled with the main expert in the up-projection phase of the expert before the activation function. Empirically, we find this approach to outperform an MoE baseline in terms of loss on a held out validation set.}
}
Endnote
%0 Conference Paper
%T StructMoE: Structured Mixture of Experts Using Low Rank Experts
%A Zain Sarwar
%A Ashwinee Panda
%A Benjamin Thérien
%A Stephen Rawls
%A Anirban Das
%A Kartik Balasubramaniam
%A Berkcan Kapusuzoglu
%A Shixiong Zhang
%A Sambit Sahu
%A Milind Naphade
%A Supriyo Chakraborty
%B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop
%C Proceedings of Machine Learning Research
%D 2024
%E Mehdi Rezagholizadeh
%E Peyman Passban
%E Soheila Samiee
%E Vahid Partovi Nia
%E Yu Cheng
%E Yue Deng
%E Qun Liu
%E Boxing Chen
%F pmlr-v262-sarwar24a
%I PMLR
%P 182--193
%U https://proceedings.mlr.press/v262/sarwar24a.html
%V 262
%X We introduce StructMoE, a method to scale MoE architectures by augmenting experts with dynamic capacity using structured matrices we call Low Rank Experts (LoRE). These LoREs are selected on a per-expert and per-token basis using a secondary router specific to every expert and are entangled with the main expert in the up-projection phase of the expert before the activation function. Empirically, we find this approach to outperform an MoE baseline in terms of loss on a held out validation set.
APA
Sarwar, Z., Panda, A., Thérien, B., Rawls, S., Das, A., Balasubramaniam, K., Kapusuzoglu, B., Zhang, S., Sahu, S., Naphade, M. & Chakraborty, S.. (2024). StructMoE: Structured Mixture of Experts Using Low Rank Experts. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:182-193 Available from https://proceedings.mlr.press/v262/sarwar24a.html.