Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

Ashwinee Panda; Vatsal Baherwani; Zain Sarwar; Benjamin Therien; Sambit Sahu; Stephen Rawls; Supriyo Chakraborty; Tom Goldstein

Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Sambit Sahu, Stephen Rawls, Supriyo Chakraborty, Tom Goldstein

Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:81-101, 2024.

Abstract

Sparsely-gated Mixture-of-Experts (MoEs) such as Gemini have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by \emph{routing} tokens to selected “experts”, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.

Cite this Paper

BibTeX

@InProceedings{pmlr-v262-panda24a,
  title = 	 {Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts},
  author =       {Panda, Ashwinee and Baherwani, Vatsal and Sarwar, Zain and Therien, Benjamin and Sahu, Sambit and Rawls, Stephen and Chakraborty, Supriyo and Goldstein, Tom},
  booktitle = 	 {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop},
  pages = 	 {81--101},
  year = 	 {2024},
  editor = 	 {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing},
  volume = 	 {262},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v262/main/assets/panda24a/panda24a.pdf},
  url = 	 {https://proceedings.mlr.press/v262/panda24a.html},
  abstract = 	 {Sparsely-gated Mixture-of-Experts (MoEs) such as Gemini have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by \emph{routing} tokens to selected “experts”, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.}
}

Endnote

%0 Conference Paper
%T Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
%A Ashwinee Panda
%A Vatsal Baherwani
%A Zain Sarwar
%A Benjamin Therien
%A Sambit Sahu
%A Stephen Rawls
%A Supriyo Chakraborty
%A Tom Goldstein
%B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop
%C Proceedings of Machine Learning Research
%D 2024
%E Mehdi Rezagholizadeh
%E Peyman Passban
%E Soheila Samiee
%E Vahid Partovi Nia
%E Yu Cheng
%E Yue Deng
%E Qun Liu
%E Boxing Chen	
%F pmlr-v262-panda24a
%I PMLR
%P 81--101
%U https://proceedings.mlr.press/v262/panda24a.html
%V 262
%X Sparsely-gated Mixture-of-Experts (MoEs) such as Gemini have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by \emph{routing} tokens to selected “experts”, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.

APA

Panda, A., Baherwani, V., Sarwar, Z., Therien, B., Sahu, S., Rawls, S., Chakraborty, S. & Goldstein, T.. (2024). Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:81-101 Available from https://proceedings.mlr.press/v262/panda24a.html.

Related Material

Download PDF