BASE Layers: Simplifying Training of Large, Sparse Models

Mike Lewis; Shruti Bhosale; Tim Dettmers; Naman Goyal; Luke Zettlemoyer

BASE Layers: Simplifying Training of Large, Sparse Models

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6265-6274, 2021.

Abstract

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-lewis21a,
  title = 	 {BASE Layers: Simplifying Training of Large, Sparse Models},
  author =       {Lewis, Mike and Bhosale, Shruti and Dettmers, Tim and Goyal, Naman and Zettlemoyer, Luke},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {6265--6274},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/lewis21a/lewis21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/lewis21a.html},
  abstract = 	 {We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released.}
}

Endnote

%0 Conference Paper
%T BASE Layers: Simplifying Training of Large, Sparse Models
%A Mike Lewis
%A Shruti Bhosale
%A Tim Dettmers
%A Naman Goyal
%A Luke Zettlemoyer
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-lewis21a
%I PMLR
%P 6265--6274
%U https://proceedings.mlr.press/v139/lewis21a.html
%V 139
%X We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released.

APA


Lewis, M., Bhosale, S., Dettmers, T., Goyal, N. & Zettlemoyer, L.. (2021). BASE Layers: Simplifying Training of Large, Sparse Models. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6265-6274 Available from https://proceedings.mlr.press/v139/lewis21a.html.

Related Material

Download PDF