TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6543-6552, 2021.

Abstract

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-li21y, title = {TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models}, author = {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {6543--6552}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/li21y/li21y.pdf}, url = {http://proceedings.mlr.press/v139/li21y.html}, abstract = {Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe} }
Endnote
%0 Conference Paper %T TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models %A Zhuohan Li %A Siyuan Zhuang %A Shiyuan Guo %A Danyang Zhuo %A Hao Zhang %A Dawn Song %A Ion Stoica %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-li21y %I PMLR %P 6543--6552 %U http://proceedings.mlr.press/v139/li21y.html %V 139 %X Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe
APA
Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D. & Stoica, I.. (2021). TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6543-6552 Available from http://proceedings.mlr.press/v139/li21y.html.

Related Material