[edit]
Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:81-101, 2024.
Abstract
Sparsely-gated Mixture-of-Experts (MoEs) such as Gemini have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by \emph{routing} tokens to selected “experts”, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.