TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li; Siyuan Zhuang; Shiyuan Guo; Danyang Zhuo; Hao Zhang; Dawn Song; Ion Stoica

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6543-6552, 2021.

Abstract

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-li21y,
  title = 	 {TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models},
  author =       {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {6543--6552},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/li21y/li21y.pdf},
  url = 	 {https://proceedings.mlr.press/v139/li21y.html},
  abstract = 	 {Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe}
}

Endnote

%0 Conference Paper
%T TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
%A Zhuohan Li
%A Siyuan Zhuang
%A Shiyuan Guo
%A Danyang Zhuo
%A Hao Zhang
%A Dawn Song
%A Ion Stoica
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-li21y
%I PMLR
%P 6543--6552
%U https://proceedings.mlr.press/v139/li21y.html
%V 139
%X Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

APA

Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D. & Stoica, I.. (2021). TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6543-6552 Available from https://proceedings.mlr.press/v139/li21y.html.

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Abstract

Cite this Paper

Related Material