Unified Scaling Laws for Routed Language Models

Aidan Clark; Diego De Las Casas; Aurelia Guy; Arthur Mensch; Michela Paganini; Jordan Hoffmann; Bogdan Damoc; Blake Hechtman; Trevor Cai; Sebastian Borgeaud; George Bm Van Den Driessche; Eliza Rutherford; Tom Hennigan; Matthew J Johnson; Albin Cassirer; Chris Jones; Elena Buchatskaya; David Budden; Laurent Sifre; Simon Osindero; Oriol Vinyals; Marc’Aurelio Ranzato; Jack Rae; Erich Elsen; Koray Kavukcuoglu; Karen Simonyan

Unified Scaling Laws for Routed Language Models

Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George Bm Van Den Driessche, Eliza Rutherford, Tom Hennigan, Matthew J Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:4057-4086, 2022.

Abstract

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-clark22a,
  title = 	 {Unified Scaling Laws for Routed Language Models},
  author =       {Clark, Aidan and De Las Casas, Diego and Guy, Aurelia and Mensch, Arthur and Paganini, Michela and Hoffmann, Jordan and Damoc, Bogdan and Hechtman, Blake and Cai, Trevor and Borgeaud, Sebastian and Van Den Driessche, George Bm and Rutherford, Eliza and Hennigan, Tom and Johnson, Matthew J and Cassirer, Albin and Jones, Chris and Buchatskaya, Elena and Budden, David and Sifre, Laurent and Osindero, Simon and Vinyals, Oriol and Ranzato, Marc'Aurelio and Rae, Jack and Elsen, Erich and Kavukcuoglu, Koray and Simonyan, Karen},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {4057--4086},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/clark22a/clark22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/clark22a.html},
  abstract = 	 {The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.}
}

Endnote

%0 Conference Paper
%T Unified Scaling Laws for Routed Language Models
%A Aidan Clark
%A Diego De Las Casas
%A Aurelia Guy
%A Arthur Mensch
%A Michela Paganini
%A Jordan Hoffmann
%A Bogdan Damoc
%A Blake Hechtman
%A Trevor Cai
%A Sebastian Borgeaud
%A George Bm Van Den Driessche
%A Eliza Rutherford
%A Tom Hennigan
%A Matthew J Johnson
%A Albin Cassirer
%A Chris Jones
%A Elena Buchatskaya
%A David Budden
%A Laurent Sifre
%A Simon Osindero
%A Oriol Vinyals
%A Marc’Aurelio Ranzato
%A Jack Rae
%A Erich Elsen
%A Koray Kavukcuoglu
%A Karen Simonyan
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-clark22a
%I PMLR
%P 4057--4086
%U https://proceedings.mlr.press/v162/clark22a.html
%V 162
%X The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

APA


Clark, A., De Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S., Van Den Driessche, G.B., Rutherford, E., Hennigan, T., Johnson, M.J., Cassirer, A., Jones, C., Buchatskaya, E., Budden, D., Sifre, L., Osindero, S., Vinyals, O., Ranzato, M., Rae, J., Elsen, E., Kavukcuoglu, K. & Simonyan, K.. (2022). Unified Scaling Laws for Routed Language Models. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:4057-4086 Available from https://proceedings.mlr.press/v162/clark22a.html.

Related Material

Download PDF