Abstract

Pipeline parallelism is a key technique for training large language models within GPU clusters. However, it often leads to a memory imbalance problem, where certain GPUs face high memory pressure while others underutilize their capacity. This imbalance results in suboptimal training performance, even when the overall GPU memory capacity is sufficient for more efficient setups. To address this inefficiency, we propose **BP IPE**, a novel approach for achieving memory balance in pipeline parallelism. **BP IPE** employs an activation balancing method to transfer intermediate activations between GPUs during training, enabling all GPUs to utilize comparable amounts of memory. With balanced memory utilization, **BP IPE** enhances the training efficiency of large language models like GPT-3 by eliminating redundant recomputations or increasing the microbatch size. Our evaluation conducted on 48 A100 GPUs across six nodes interconnected with HDR InfiniBand shows that **BP IPE** accelerates the training of GPT-3 96B and GPT-3 134B models by 1.25x-2.17x compared to Megatron-LM, a state-of-the-art framework for training large language models.

1. Introduction

After the advent of the Transformer architecture (Vaswani et al., 2017), there has been a dramatic increase in the size of language models (Brown et al., 2020; Smith et al., 2022; Zhang et al., 2022; Chowdhery et al., 2022; Ouyang et al., 2022; Thoppilan et al., 2022; OpenAI, 2023). These models show astonishing results in a wide range of applications by exploiting more than a hundred billion parameters. Such an overwhelming number of parameters incurs high memory pressure, making large language model (LLM) training challenging. When training a model in mixed precision (Micikevicius et al., 2017) with the Adam (Kingma & Ba, 2014) optimizer, we need 20 bytes of memory for each model parameter (Smith et al., 2022). Hence, training a GPT-3 175B model needs more than 3,000 GiB to store the model parameters and optimizer states. Yet, no GPU exists whose memory capacity satisfies the requirement.

A few methods, such as model parallelism and activation recomputation (Griewank & Walther, 2000; Kirisame et al., 2021), alleviate the memory pressure to satisfy the requirement. Model parallelism partitions the model parameters and optimizer states across multiple GPUs so that each GPU stores a subset of the model parameters. It is further classified into tensor parallelism (Shazeer et al., 2018; Shoeby et al., 2019) and pipeline parallelism (Huang et al., 2019), where tensor parallelism splits the operations across GPUs and pipeline parallelism splits the layers across GPUs. On the other hand, activation recomputation releases intermediate activations from memory right after forward compu-
tation and recomputes them during backward computation. Since model parallelism and activation recomputation add communication or computation overhead, how we configure them affects training performance significantly. Therefore, finding the configuration that achieves maximum performance and then scaling up training with data parallelism is essential for efficient LLM training.

However, due to its nature, pipeline parallelism could hinder finding the optimal configuration. Unlike tensor parallelism, pipeline parallelism assigns each GPU to handle a separate pipeline stage that computes different layers in a model. Accordingly, each pipeline stage has data dependency on others and results in computation stalls until the required data have arrived, commonly known as a pipeline bubble. To minimize the bubble, a 1F1B (one-forward and one-backward) pipeline schedule (Narayanan et al., 2019; Fan et al., 2021) splits an input batch into micro-batches and processes forward computation and backward computation alternately. In order to saturate all pipeline stages, earlier stages should reserve more memory for computing more forward micro-batches than later stages. Consequently, a memory imbalance exists across the pipeline stages, and executing the model fails if the earlier stages run out of memory, as illustrated in Figure 1.

Our key observation is that the later stages cannot utilize the same amount of GPU memory as the earlier stages require to precompute forward micro-batches. Therefore, if we can exploit the spare memory of later stages as extra memory of the earlier stages, the memory pressure will be relieved with a balanced memory load. In addition, reduced memory pressure allows us to utilize more memory to accelerate training by avoiding redundant recomputations, increasing the micro-batch size, or decreasing the model parallelism degree.

To this end, we propose BPIPE, a memory-balanced pipeline parallelism approach with an activation balancing method to resolve the memory imbalance. While training, BPIPE flattens the memory usage of pipeline stages by transferring activations between earlier and later stages. We propose a transfer scheduling algorithm to minimize the number of transfers while preserving the computation correctness. Furthermore, we design a pair-adjacent stage assignment to make transfers not affect the training time. As a result, BPIPE achieves a balanced GPU memory usage and facilitates efficient LLM training.

We have implemented BPIPE on Megatron-LM (Korthikanti et al., 2022). Our evaluation on six HPE Apollo 6500 8-GPU A100 nodes with 800 Gbps cross-node bandwidth shows that BPIPE can accelerate training GPT-3 96B and GPT-3 134B models by 1.25x-2.17x by executing more efficient training configurations with fewer recomputations and larger micro-batch sizes.

Figure 2. An illustration of a 4-way 1F1B pipeline schedule with eight micro-batches. The memory pressure of each pipeline stage increases during the warmup phase, stays constant at the steady phase, and decreases within the cooldown phase. After the cooldown phase, parameters are updated with accumulated gradients of each micro-batch. A number in either forward or backward denotes the micro-batch index.

2. Background & Motivation

When training LLMs with limited GPU resources, model parallelism is necessary to split the model into multiple partitions and make each part fit into a single GPU. Depending on how the model is divided, we can classify model parallelism into two categories: tensor parallelism and pipeline parallelism.

Tensor parallelism partitions an operation so that each GPU executes the same operation with partial inputs. An additional synchronize operation, usually collective communication such as all-reduce and all-gather, follows the partitioned operation. Owing to its costly synchronization, tensor parallelism is known to be practical when used within a machine boundary (Narayanan et al., 2021b).

On the contrary, pipeline parallelism partitions the layers of a model into multiple stages and distributes them across the GPUs. While training, two consecutive pipeline stages exchange intermediate activations or gradients with point-to-point communication. Since the point-to-point communication of pipeline parallelism adds small overheads compared to the synchronization of tensor parallelism, the pipeline parallelism degree can increase along with the model size.

However, increasing the degree of pipeline parallelism yields a memory imbalance problem between pipeline stages. Using the 1F1B pipeline schedule, as illustrated in Figure 2, the first stage has to store activations as many micro-batches as the pipeline parallelism degree during the warmup phase. For the other stages, the number of micro-batches in the warmup phase decreases linearly, so the last stage holds activations of a single micro-batch. Consequently, the imbalance of memory usage exists across the pipeline stages, and its magnitude amplifies with an escalation in the pipeline parallelism degree.

Figure 3 shows the memory usage of each pipeline stage when training a GPT-3 13B (Brown et al., 2020) model.
with 8-way pipeline parallelism using 8 NVIDIA A100 80 GiB GPUs. In 8-way pipeline parallelism, the first stage computes eight micro-batches during the warmup phase. Therefore, the difference in the number of micro-batches between stage 0 and stage 7 is seven, causing a 37 GiB memory imbalance. In other words, stage 7 only utilizes up to half of the memory due to the imbalance. Even worse, attempting to accelerate training by utilizing the unused memory of the last pipeline stage (e.g., increasing the micro-batch size) might fail because the first stage no longer has enough memory.

Asymmetric partitioning of the pipeline stage (i.e., later stages contain more layers than earlier stages) might eliminate the imbalance, but it incurs inefficiency because the pipeline latency is minimized when each pipeline stage has the same computation time (Narayanan et al., 2019; Fan et al., 2021; Li et al., 2021b; Zheng et al., 2022). Alternatively, Chimera (Li & Hoefler, 2021) relieves the imbalance by replicating model parameters so that each GPU holds two pipeline stages. However, since replicating model parameters doubles the memory usage, the overall memory usage increases for the same pipeline degree. To the best of our knowledge, BPipe is the first work that speeds up training LLMs by addressing the memory imbalance with smart activation transfer across GPUs.

3. Method

In this section, we describe an activation balancing method that solves the memory imbalance problem by transferring activations between pipeline stages. First, we formalize the memory imbalance and provide a high-level view of the activation balancing. Then, we define the balanced memory objective and demonstrate an activation transfer scheduling algorithm that achieves the objective. Finally, we describe how we assign pipeline stages to GPUs to minimize the transfer time.

3.1. Pipeline Memory Imbalance

On $p$-way pipeline parallelism with $m$ micro-batches, the memory usage of the pipeline stage $s$ can be expressed as $M(s) = W(s) + A(s)$, where $A(s)$ is the maximum amount of saved micro-batches and $W(s)$ is the size of model parameters including optimizer states. To simplify, assume that a model has repetitive layers which account for almost all of the model parameters, such as Transformer models (Devlin et al., 2018; Raffel et al., 2020; Brown et al., 2020). If all pipeline stages have an even number of layers, $W(s) = W_0$ is constant. Then, $A(s)$ that varies with the pipeline stage $s$ determines the memory usage. If we define $\mu(s)$ as the maximum number of saved micro-batches on stage $s$, $A(s)$ becomes $A(s) = A_0 \mu(s)$, where $A_0$ is the size of saved activations per micro-batch, which is also a constant value when each pipeline stage has the same number of layers. As a result, $M(s)$ can be written as the following.

$$M(s) = W_0 + A_0 \mu(s)$$

(1)

According to Figure 2, each pipeline stage in the 1F1B pipeline schedule possesses at most $\mu(s) = \min(p - s, m)$. In the practical case, $m \gg p$ is satisfied because such a case fully saturates all pipeline stages. Therefore, $\mu(s)$ of the 1F1B pipeline schedule satisfies the following relation.

$$\mu(s) = p - s$$

(2)

Substituting $\mu(s)$ in Eq. 1 with $p - s$, Eq. 1 denotes that the difference in the number of saved micro-batches is a cause of the imbalance of $M(s)$.

3.2. Activation Balancing

BPipe eliminates the memory imbalance by reducing the difference in the number of saved micro-batches for all pipeline stages. Eq. 1 and Eq. 2 show that the memory usage of the pipeline stage decreases linearly as the pipeline stage increases. Accordingly, we pair each stage $s$ with stage $p - s - 1$ to balance $\mu(s)$ and $\mu(p - s - 1)$. The memory imbalance within each pair can be written as the following equation.

$$M(s) - M(p - s - 1) = A_0(p - 2s - 1)$$

(3)

Eq. 3 implies that the pipeline stages with $s < \frac{p - 1}{2}$ occupy more memory than the stages with $s > \frac{p - 1}{2}$. We define the former stages as evictors and the latter stages as acceptors. Each pair then consists of an evictor and an acceptor, where the evictor evicts activations to its pair acceptor to balance $\mu(s)$ and $\mu(p - s - 1)$.
An objective of the activation balancing is accomplishing $\mu(s) \leq \mu_{opt}$ for all pipeline stages throughout the training. An evictor with pipeline stage $s$ achieves the objective by evicting at most $(p - s) - \mu_{opt} + 1$ micro-batches to its pair acceptor. Simultaneously, the pair acceptor saves at most $s + 1$ micro-batches following the pipeline schedule and $(p - s) - \mu_{opt} + 1$ micro-batches from the evictor. $\mu(p - s - 1)$ then becomes $p + 2 - \mu_{opt}$, where the value is less or equal to $\mu_{opt}$. Therefore, both the evictor and acceptor achieve the objective.

The memory objective implies that BPIPE does not trigger any transfer of activations for the evictors whose stages already satisfy the objective. In other words, the activation balancing operates when $p \geq 4$, and only the evictors whose pipeline stage satisfies $s \leq \lfloor \frac{p - 1}{2} \rfloor$. For example, the inner-most pair of pipeline stages, which is stage 1 and stage 2 in Figure 4, does not transfer activations because they already satisfy the objective.

### 3.4. Transfer Schedule

To achieve the memory objective, we propose a transfer scheduling algorithm. The algorithm gets a pipeline parallelism degree $p$, a pipeline stage $s$ of an evictor, the number of micro-batches $m$, and a computation schedule $C$ of 1F1B pipeline as inputs and returns a transfer schedule $T$ for stage $s$. A computation schedule is an array of computation decisions. Each computation decision $C_i$ comprises the type of computation made and the index of the micro-batch being processed. For example, $C$ of pipeline stage 0 in Figure 4 has 22 computation decisions, including pipeline bubbles. $C_0$ has 0 as a micro-batch index with a FORWARD computation type, $C_9$ has 1 as a micro-batch index with a BACKWARD computation type, and $C_{18}$ has an empty micro-batch index because the computation type of $C_{18}$ is a BUBBLE.

Algorithm 1 describes the details of scheduling for the given inputs. The algorithm traverses the computation schedule, deciding when to evict or load. Within the warmup phase, $n_{evict}$ micro-batches are evicted to make $\mu(s)$ equal to $\mu_{opt}$ (lines 9-13). When the algorithm encounters a BACKWARD type computation decision at $i$ and finds that the micro-batch required to compute the backward (i.e., $C_i.idx$) has been evicted, it loads the micro-batch at $i - 1$ (lines 14-18). However, this load would make $\mu(s)$ exceed $\mu_{opt}$ if $C_i$ belongs to the steady phase. In other words, when $C_{i - 1}$ is a FORWARD type, both $C_{i - 1}$ and the load at $i - 1$ increase $\mu(s)$ so that $\mu(s)$ exceeds $\mu_{opt}$. The algorithm prevents it by evicting an additional micro-batch that will be needed at the furthest future at $i - 2$, whose index is $C_{i - 3}.idx$ (lines 19-23). On the other hand, if $C_i$ belongs to the cooldown phase, $C_{i - 1}$ is the BUBBLE type which

\[ W_0 + A_0 \mu_{opt}. \]

### 3.3. Balanced Memory Objective

From Eq. 2, the sum of the maximum number of saved micro-batches for any evictor-acceptor pair is $\mu(s) + \mu(p - s - 1) = p + 1$. Including a buffer for the additional evictions at the steady phase, the value becomes $p + 2$. Now, the optimally balanced number of micro-batches $\mu_{opt}$ is derived as $\mu_{opt} = \lceil \frac{p + 2}{2} \rceil$, and the optimal memory balance $M_{opt}$ is
Algorithm 1 Transfer Scheduling Algorithm

1:   Input: $p, s, m$, 1F1B computation schedule $C$
2:   Output: transfer schedule $T$
3:   \[ \mu_{opt} \leftarrow \left[ \frac{m}{p} \right] \] 
4:   \[ n_{evict} \leftarrow \min(p - s, m) - \mu_{opt} \]
5:   \[ evicted \leftarrow \emptyset \]
6:   \[ T_i \leftarrow \text{None} \forall 0 \leq i < |C| \]
7:   for $i = 0$ to $|C| - 1$
8:   if $C_i$.type = FORWARD then
9:     if $\mu_{opt} - 1 \leq C_i$.idx < $\mu_{opt} - 1 + n_{evict}$ then
10:        /* warmup phase */
11:        $T_i \leftarrow \text{Evict}(C_i$.idx) 
12:        evicted $\leftarrow$ evicted $\cup \{C_i$.idx\}
13:   end if
14:   else if $C_i$.type = BACKWARD then
15:        /* steady or cooldown phase */
16:        if $C_i$.idx $\in$ evicted then
17:           $T_{i-1} \leftarrow \text{Load}(C_i$.idx)
18:           evicted $\leftarrow$ evicted $\setminus \{C_i$.idx\}
19:        if $C_{i-1}$.type = FORWARD then
20:           /* steady phase */
21:           /* evict to keep $\mu(s) \leq \mu_{opt}$ */
22:           $T_{i-2} \leftarrow \text{Evict}(C_{i-3}$.idx)
23:           evicted $\leftarrow$ evicted $\cup \{C_{i-3}$.idx\}
24:        end if
25:    end if
26:  end if
27: end for

does not increase $\mu(s)$. Thus, $\mu(s)$ does not exceed $\mu_{opt}$, and the algorithm does not evict an additional micro-batch.

The generated transfer schedule $T$ contains an array of transfer decisions that interleaved with the 1F1B computation schedule. While processing the 1F1B schedule, BPIPE simultaneously processes the transfer schedule. BPIPE evicts the saved activations of the $j$-th micro-batch if $T_i$ is Evict($j$) and loads them if $T_i$ is Load($j$). When $T_i$ is None, a default decision in line 6, BPIPE does not process any transfer.

Following the computations with the corresponding transfer decisions, BPIPE achieves the memory objective for all pipeline stages with the minimum number of transfers. The number of transfers is the summation of the number of Evicts and Loads. We only consider the number of Evicts because each eviction requires exactly one corresponding Load before processing the backward computation. Subsequently, we can interpret the micro-batch eviction as analogous to cache eviction. In other words, an evictor manages a cache storage whose capacity is $\mu_{opt}$. It evicts an item (i.e., forward micro-batch) to the memory space of its pair acceptor when the cache is full. Deciding which item to evict is then determined based on a cache eviction policy. Unlike common caching scenarios, we know exactly when each item is required in the future. Therefore, as evicting the item that will be needed in the furthest future is optimal (Belady, 1966), our scheduling algorithm minimizes the number of transfers.

Although the algorithm assumes a 1F1B computation schedule as an input, we can extend it to other variants of 1F1B (Narayanan et al., 2021a; Yang et al., 2021; Athlur et al., 2022; Zhuang et al., 2022) because they all have a memory imbalance. Appendix A describes how we can extend the algorithm to the interleaved 1F1B pipeline schedule (Narayanan et al., 2021b).

### 3.5. Pair-Adjacent Assignment

As BPIPE processes a transfer schedule simultaneously with a 1F1B computation schedule, each transfer should take less time than FORWARD or BACKWARD not to affect the training time. To minimize the transfer time, we propose a pair-adjacent assignment that locates each evictor-acceptor pair in the same node in a cluster. Within the same node,
**4. Evaluation**

To evaluate BPIPE, we ask the following questions.

- Does BPIPE facilitate faster training of large language models? (§ 4.2)
- Does BPIPE flatten the memory usage of each pipeline stage? (§ 4.3)
- Does BPIPE efficiently evict and load activations without performance degradation? (§ 4.4)

**4.1. Implementation and Environment Setup**

We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). We utilize separate CUDA streams for evicting and loading activations so that activations can be transferred concurrently with either forward or backward computation. In addition, we manually manage CUDA memory for activations that are needed for backward computation and reuse pre-allocated memory after the eviction to avoid unexpected memory fragmentation.

Our evaluations are conducted on a cluster of six HPE Apollo 6500 Gen10 Plus nodes, each of which is equipped with 8 NVIDIA 80 GiB A100 GPUs connected over NVLink and 4 Mellanox 200 Gbps HDR InfiniBand HCAs for communication. All experiments are executed on the NVIDIA PyTorch NGC 22.09 container. We evaluate GPT-3 (Brown et al., 2020) throughout the experiments, one of the most representative LLMs. We use three different model configurations, as shown in Table 1, in which the largest model has 134 billion parameters in total. Sequence length and vocabulary size are 2,048 and 51,200 for all models, respectively, and we use mixed precision training (Micikevicius et al., 2017). Although we evaluate only GPT-3 models, BPIPE is applicable to any model as long as pipeline parallelism is used to train the model.

**4.2. Training Performance**

To evaluate whether BPIPE can accelerate training, we find the fastest configuration for training GPT-3 96B and GPT-3 134B models, with and without BPIPE. Our baseline is

---

**Table 1. Model configurations for evaluation.**

<table>
<thead>
<tr>
<th>Model</th>
<th>L</th>
<th>D</th>
<th>H</th>
<th>G</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPT-3 13B</td>
<td>40</td>
<td>5120</td>
<td>40</td>
<td>8</td>
<td>32</td>
</tr>
<tr>
<td>GPT-3 96B</td>
<td>80</td>
<td>9984</td>
<td>104</td>
<td>32</td>
<td>128</td>
</tr>
<tr>
<td>GPT-3 134B</td>
<td>84</td>
<td>11520</td>
<td>120</td>
<td>48</td>
<td>192</td>
</tr>
</tbody>
</table>

---

**Figure 6.** An illustration of how the pair-adjacent assignment assigns pipeline stages to GPUs when 4-way pipeline parallelism with 2-way tensor parallelism and 2-way data parallelism on two nodes, each with 8 GPUs. ‘TP’ and ‘DP’ denote the rank of tensor parallelism and data parallelism. ‘PP’ represents the pipeline stage.
Table 2. Training configurations of GPT-3 96B and GPT-3 134B models. ‘tensor’ and ‘pipeline’ represent the tensor and pipeline parallelism degrees, respectively. The remaining GPUs are used for data parallelism, and we use ZeRO stage-1 data parallelism (Rajbhandari et al., 2020) that splits optimizer states. Moreover, tensor parallelism includes partitioning layer normalization and dropout, as introduced in Korthikanti et al. 2022. ‘mb’ denotes the micro-batch size, and each value corresponds to a different training configuration.

<table>
<thead>
<tr>
<th>Model</th>
<th>Model Parallelism ID</th>
<th>tensor</th>
<th>pipeline</th>
<th>mb</th>
<th>Recompute scope</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPT-3 96B</td>
<td>(1) 1 16</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(2) 2 8</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(3) 4 4</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(4) 8 2</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(5) 2 16</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(6) 4 8</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(7) 8 4</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>GPT-3 134B</td>
<td>(1) 2 12</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(2) 4 6</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(3) 8 3</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(4) 4 12</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>(5) 8 6</td>
<td>1 layer</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Megatron-LM v3 (Korthikanti et al., 2022), a state-of-the-art LLM training framework. We perform a grid search to find the best configuration as follows. In addition to the notations in Table 1, we define tensor parallelism degree as \( t \), pipeline parallelism degree as \( p \), data parallelism degree as \( d \), and micro-batch size as \( mb \). Then, the following constraints exist.

- \( H \mod t = 0 \): The number of attention heads should be divisible by the tensor parallelism degree because tensor parallelism for Transformer layers splits attention heads.
- \( 8 \mod t = 0 \) (8 is the number of GPUs in a single node): Tensor parallelism is practical when used within a node boundary due to its costly synchronization.
- \( L \mod p = 0 \): The number of layers should be divisible by the pipeline parallelism degree, assuming that pipeline parallelism evenly divides the layers.
- \( B \mod (mb \times d) = 0 \): The total batch size should be divisible by the micro-batch size times data parallelism degree. The number of micro-batches is \( B/(mb \times d) \).
- \( t \times d \times p = G \): Multiplication of tensor, data, and pipeline parallelism degrees should be equal to the total number of GPUs.

Under the constraints, we enumerate all possible tuples of \((t, p, d, mb)\). We evaluate them with Megatron-LM for each recomputation scope in the order of none, attention, and layer, where the none scope does not recompute any activation, the attention scope recomputes only the self-attention of the Transformer layer, which is known as the selective recomputation (Korthikanti et al., 2022), and the layer scope recomputes the entire Transformer layer. We apply early stopping when succeeding to execute with fewer recomputations because carrying out more recomputations for the same \((t, p, d, mb)\) is inefficient. Then, each \((t, p, d, mb)\) with a recomputation scope composes a single training configuration. For BIPE, we evaluate the configurations that Megatron-LM cannot execute since the activation balancing does not accelerate training. If BIPE fails due to the out-of-memory error, we exclude those configurations. As a result, Table 2 lists all feasible training configurations.

We use model FLOPS utilization (MFU) as an evaluation metric, a ratio of the observed throughput to the hardware maximum throughput (Korthikanti et al., 2022). Figure 7 shows that BIPE can execute the configuration that achieves 52.06% MFU, though Megatron-LM cannot execute it due to the memory imbalance problem. Consequently, BIPE accelerates training the models by 1.25x compared to the fastest training configuration that Megatron-LM can execute and 2.17x than the most inefficient configuration of Megatron-LM. The raw MFU numbers of Figure 7 are listed in Appendix C.

4.3. Memory Balancing

Figure 8 presents the change in memory usage when using BIPE. Without the activation balancing, the maximum memory usage difference between the pipeline stages is larger than 25 GiB. The model cannot be executed because the first stage runs out of memory. However, the difference sharply reduces to 10 GiB with BIPE because BIPE flattens the memory usage of each evictor-acceptor pair. As a result, BIPE facilitates faster training with more efficient resource utilization, as Figure 7 shows.

If the model size grows, the pipeline parallelism degree should increase because increasing the tensor parallelism degree is bounded up to the number of GPUs in a single node. Then, the memory imbalance would also increase following the pipeline parallelism degree. For example, Narayanan et al. (2021b) and Korthikanti et al. (2022) use...
BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models

Figure 7. Normalized training throughputs of GPT-3 96B and GPT-3 134B models. Labels of the bars represent the training configurations of Table 2. For example, a bar whose label is (1)-mb1 denotes training configuration with model parallelism ID (1) and the micro-batch size as 1. ‘X’ denotes that Megatron-LM fails to execute due to the out-of-memory (OOM). On the other hand, BPipe makes executing all of them possible with the activation balancing. Moreover, they are faster than all the configurations that Megatron-LM can execute.

Figure 8. Memory usage of each pipeline stage with the activation balancing compared to without the activation balancing when training a GPT-3 134B model with configuration (4)-mb2 of Table 2. To estimate the memory usage without the activation balancing, we allow the activation balancing only for stage 0 and stage 11 since stage 0 has insufficient memory to execute.

4.4. Performance Analysis

We inspect the efficiency of BPipe by comparing the iteration time before and after applying the activation balancing. Figure 9 shows the additional time cost of the activation balancing, varying the batch size from 32 to 1,024 when training a GPT-3 13B model with 8-way pipeline parallelism. The cost occupies 1.1% even when the number of micro-batches is sufficiently large.

BPipe achieves a low time cost by virtue of the asynchronous activation transfer. If we transfer activations synchronously, the time overhead shoots up to 11%, as Figure 9 shows. On the contrary, asynchronous transfer overlaps with the computation because the computation time takes far longer than the transfer time. The transfer overlaps even if the model size grows because the computation time increases more steeply than the transfer time, as Table 3 shows. Furthermore, the size of activations to transfer increases with fewer recomputations. However, the transfer time does not exceed the forward time even for the no recomputation, as Table 4 shows.

5. Related Work

Data parallelism. Data parallelism replicates model parameters and optimizer states across GPUs (Li et al., 2020). An input batch is divided into multiple mini-batches for each training step, so each GPU conducts forward-backward computation with a different mini-batch. After each GPU finishes a single training step, all GPUs carry out an all-reduce collective communication to synchronize gradients and then update the model parameters. Thereby, all GPUs always have the same parameters throughout the training. Since collective communication occurs only once in a single
Table 3. Time breakdown of transfer, forward, and backward. All times are the elapsed time for processing a single micro-batch. For GPT-3 13B, 8-way pipeline parallelism is used with the attention recomputation scope and the micro-batch size is 1. For GPT-3 96B and GPT-3 134B, we use configuration (6)-mb2 and (4)-mb2 of Table 2, respectively.

<table>
<thead>
<tr>
<th>Model</th>
<th>transfer</th>
<th>forward</th>
<th>backward</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPT-3 13B</td>
<td>7.63 ms</td>
<td>36.32 ms</td>
<td>83.57 ms</td>
</tr>
<tr>
<td>GPT-3 96B</td>
<td>15.63 ms</td>
<td>143.37 ms</td>
<td>287.90 ms</td>
</tr>
<tr>
<td>GPT-3 134B</td>
<td>12.06 ms</td>
<td>126.87 ms</td>
<td>256.30 ms</td>
</tr>
</tbody>
</table>

Table 4. Time breakdown by varying the recomputation scope of a GPT-3 13B model with 8-way pipeline parallelism.

<table>
<thead>
<tr>
<th>Recompute scope</th>
<th>transfer</th>
<th>forward</th>
<th>backward</th>
</tr>
</thead>
<tbody>
<tr>
<td>none</td>
<td>22.48 ms</td>
<td>36.32 ms</td>
<td>74.12 ms</td>
</tr>
<tr>
<td>attention</td>
<td>7.63 ms</td>
<td>36.32 ms</td>
<td>83.57 ms</td>
</tr>
<tr>
<td>layer</td>
<td>0.58 ms</td>
<td>36.32 ms</td>
<td>110.33 ms</td>
</tr>
</tbody>
</table>

Training step, data parallelism is the best way to scale up training. Beyond replicating model parameters, ZeRO (Rajbhandari et al., 2020) and Fully Sharded Data Parallelism (FSDP) (Zhao et al., 2023) reduce the memory pressure by splitting optimizer states. They propose more aggressive splits, including gradients and model parameters, with more frequent collective communication.

Tensor parallelism. Mesh-TensorFlow (Shazeer et al., 2018) introduces an abstraction that can express arbitrary operation partitioning. It also presents that data parallelism is one of the cases of tensor parallelism that splits tensors across the batch dimension. Moreover, some model-specific tensor parallel strategies exist that efficiently reduce memory pressure with moderate synchronization costs. For example, Megatron-LM (Shoeybi et al., 2019) proposes an efficient partitioning strategy of two consecutive matrix multiplications in the Transformer layer. Additionally, partitioning along the sequence dimension of the Transformer layer is also feasible (Li et al., 2021a; Korthikanti et al., 2022).

Pipeline parallelism. To further minimize the pipeline bubble, several pipeline schedules (Narayanan et al., 2021a;b; Li & Hoeffler, 2021; Yang et al., 2021; Athlur et al., 2022; Zhuang et al., 2022) have been proposed. Such schedules stem from the 1F1B schedule but sacrifice memory usage or even affect model correctness. Furthermore, token-level scheduling (Li et al., 2021b) is proposed for an auto-regressive language model (Brown et al., 2020). However, it is effective when the batch size is much smaller than the pipeline parallelism degree, which is unlikely in practical LLM training scenarios.

Activation recomputation. Since recomputation incurs an overhead equal to the forward computation time, activations to discard and reserve should be well-decided. Chen et al. (2016) proposed an algorithm that can reduce memory consumption to sub-linear costs for a linear computation chain. Checkmate (Jain et al., 2020) solves a mixed integer linear program to find an optimal recomputation strategy for an arbitrary model structure. When models have repetitive layers, recomputing activations layer by layer is widely adopted in practice. In addition to the layerwise strategy, recent research (Korthikanti et al., 2022) has shown that recomputing only the attention of the Transformer layer can efficiently reduce the memory pressure with small computation overhead.

CPU offloading. In general, a CPU has orders of magnitude larger, cheaper, and slower memory than a GPU. Accordingly, several approaches have been proposed that exploit CPU memory as a swap storage of limited GPU memory (Rhu et al., 2016; Wang et al., 2018; Pudipeddi et al., 2020; Peng et al., 2020; Huang et al., 2020; Ren et al., 2021; Rajbhandari et al., 2021). However, utilizing CPU memory is not scalable because communication between CPU and GPU is slow due to the limited communication bandwidth of PCIe.

Automatic search for the optimal training configuration. Given a deep learning model and a cluster environment, various systems exist that automatically search for the optimal execution plan (Wang et al., 2019; Jia et al., 2019; Lepikhin et al., 2020; Rasley et al., 2020; Tarnawski et al., 2021; Xu et al., 2021; Bian et al., 2021; Karakus et al., 2021; Jia et al., 2022; Zheng et al., 2022; Unger et al., 2022). Such systems configure a search space with a cost model and explore it to find the best result. As BPIPE rewrites the memory usage of pipeline parallelism, BPIPE can expand the search space and help find better configurations.

6. Conclusion
We propose BPIPE, a memory-balanced pipeline parallelism method for training LLMs. With the activation balancing that transfers intermediate activations between pipeline stages, BPIPE resolves the memory imbalance problem of pipeline parallelism. While training, all pipeline stages utilize a comparable amount of memory by storing a balanced amount of activations in the unit of micro-batch. Our evaluation shows that BPIPE speeds up training large-scale GPT-3 models by executing faster training configurations that are not feasible without BPIPE.
BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models

References


A. Transfer Scheduling for Interleaved 1F1B Pipeline Schedule

\[ \mu_{\text{opt}} = \left\lceil \frac{p(v - 1) + 2(p - s - 1) + 1}{2} \right\rceil = \frac{p(v - 1) + 2(p - s - 1) + 1 + 1}{2} \]

The transfer scheduling of the interleaved 1F1B pipeline schedule is similar to that of the vanilla 1F1B schedule. Figure 10 illustrates how the activation balancing operates with a 4-way interleaved 1F1B pipeline schedule with two model chunks. \( \mu_{\text{opt}} \) becomes 9, so stage 0 evicts two micro-batches to stage 3. In the steady or cooldown phase, it loads the evicted micro-batches to perform the backward computations.

B. Network Bandwidth Requirement for Activation Balancing

<table>
<thead>
<tr>
<th>Variable</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>( L )</td>
<td>number of layers</td>
</tr>
<tr>
<td>( H )</td>
<td>number of attention heads</td>
</tr>
<tr>
<td>( D )</td>
<td>hidden dimension size</td>
</tr>
<tr>
<td>( l )</td>
<td>sequence length</td>
</tr>
<tr>
<td>( p )</td>
<td>pipeline parallelism degree</td>
</tr>
<tr>
<td>( t )</td>
<td>tensor parallelism degree</td>
</tr>
<tr>
<td>( b )</td>
<td>micro-batch size</td>
</tr>
<tr>
<td>( f )</td>
<td>forward computation time of a single micro-batch [sec]</td>
</tr>
</tbody>
</table>

Since each transfer of the activation balancing should finish earlier than the forward computation of a single micro-batch, we can calculate the required network bandwidth of a single GPU with the forward computation time. Using the variables in Table 5, we can derive the bandwidth requirements as the following. According to Korthikanti et al. (2022), the bytes of activation memory per each forward micro-batch are:

No recomputation: \( \frac{Lbh}{pt}(34 + \frac{5Hl}{h}) \) bytes
Table 6. MFU numbers of the GPT-3 96B model, when the tensor parallelism degree is 8.

<table>
<thead>
<tr>
<th>Model ID</th>
<th>Model Parallelism</th>
<th>mb</th>
<th>Recompute scope</th>
<th>Megatron-LM MFU [%]</th>
<th>BPIPE MFU [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPT-3</td>
<td>8</td>
<td>4</td>
<td>attention</td>
<td>36.18</td>
<td>35.80</td>
</tr>
<tr>
<td>96B</td>
<td></td>
<td></td>
<td>2 attention</td>
<td>36.56</td>
<td>36.52</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>4 layer</td>
<td>38.04</td>
<td>38.00</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>8 layer</td>
<td>24.31</td>
<td>27.47</td>
</tr>
</tbody>
</table>

Attention recomputation scope: \( 34 \frac{Llbh}{pt} \) bytes

Layer recomputation scope: \( 2 \frac{Llbh}{p} \) bytes

Dividing Eq (4) to (6) by \( f \), bandwidth requirements are derived as the following equations.

No recomputation:

\[
1 \frac{Llbh}{10^9 pt f} (34 + \frac{5Hl}{h}) \text{ GB/s} \tag{7}
\]

Attention recomputation scope:

\[
34 \frac{Llbh}{10^9 pt f} \text{ GB/s} \tag{8}
\]

Layer recomputation scope:

\[
2 \frac{Llbh}{10^9 pt f} \text{ GB/s} \tag{9}
\]

If a single node has 8 GPUs, two GPUs within the node where an evictor-acceptor pair resides should have larger bandwidth than Eq (7) to (9) when the tensor parallelism degree is 1, 2, or 4. For the tensor parallelism degree of 8, inter-node bandwidth divided by 8 should be larger than the derived bandwidths.

In our evaluation, the fastest configuration of BPIPE for the GPT-3 96B model uses 4-way tensor parallelism, 8-way pipeline parallelism, attention recomputation scope, and the micro-batch size of 2. Then the total bytes to transfer are 3.48 GB. Since each forward computation takes 143.37 ms in Table 3, the required bandwidth is 24.25 GB/s. As the forward computation time does not change by the recomputation scope, we can derive that NVLink is also sufficient for no recomputation in which the required bandwidth is 100.31 GB/s.

When the tensor parallelism degree is equal to the number of GPUs in a single node, BPIPE transfers activations across the node boundary. However, the inter-node transfer is feasible if (1) a cluster has a fast inter-node network such as InfiniBand and (2) recomputation is used for training. Moreover, we alleviate the bandwidth requirements with the following optimization. In general, backward computation takes twice as long as forward computation. Therefore, we allow the consecutive \textit{Evict-Load} to be performed over the timespan of consecutive \textit{BACKWARD-FORWARD} computation within the steady phase. For example, in Figure 4, evicting and loading the micro-batches whose indices are 3 and 1 can be completed within the sum of the time required for backward and forward computations. The bandwidth requirements are then relieved as the following equations.

No recomputation:

\[
1 \frac{2Llbh}{10^9 3pt f} (34 + \frac{5Hl}{h}) \text{ GB/s} \tag{10}
\]

Attention recomputation scope:

\[
\frac{68}{3} \frac{Llbh}{10^9 pt f} \text{ GB/s} \tag{11}
\]

Layer recomputation scope:

\[
\frac{4}{3} \frac{Llbh}{10^9 pt f} \text{ GB/s} \tag{12}
\]

Table 6 presents the MFU values when the tensor parallelism degree is 8 for the GPT-3 96B model. The results show that BPIPE is feasible even when the tensor parallelism degree is 8. Furthermore, when the micro-batch size is 8, the MFU value of BPIPE is higher than the MFU value of Megatron-LM. For such a large micro-batch size, we observed that PyTorch periodically vacates the cache allocator due to high memory pressure, resulting in performance degradation. BPIPE partially avoids such inefficiency with the activation balancing.
## C. Raw MFU Numbers

Table 7. Raw MFU numbers of Figure 7. The numbers are the values of Megatron-LM, except those annotated with **BPipe**. For those configurations, Megatron-LM fails to run due to the out-of-memory error.

<table>
<thead>
<tr>
<th>Model ID</th>
<th>Model Parallelism</th>
<th>mb</th>
<th>Recompute scope</th>
<th>MFU [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>layer</td>
<td></td>
</tr>
<tr>
<td>(1)</td>
<td>1 16</td>
<td>1</td>
<td>layer</td>
<td>38.08</td>
</tr>
<tr>
<td>(2)</td>
<td>2 8</td>
<td>1</td>
<td>layer</td>
<td>40.46</td>
</tr>
<tr>
<td>(3)</td>
<td>4 4</td>
<td>1</td>
<td>attention</td>
<td>37.59</td>
</tr>
<tr>
<td>(4)</td>
<td>8 2</td>
<td>1</td>
<td>attention</td>
<td>37.09</td>
</tr>
<tr>
<td>(5)</td>
<td>2 16</td>
<td>1</td>
<td>attention</td>
<td>48.78 <strong>(BPipe)</strong></td>
</tr>
<tr>
<td>(6)</td>
<td>4 8</td>
<td>1</td>
<td>attention</td>
<td>36.23</td>
</tr>
<tr>
<td>(7)</td>
<td>8 4</td>
<td>1</td>
<td>attention</td>
<td>36.84</td>
</tr>
</tbody>
</table>

GPT-3 96B

<table>
<thead>
<tr>
<th>Model ID</th>
<th>Model Parallelism</th>
<th>mb</th>
<th>Recompute scope</th>
<th>MFU [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>layer</td>
<td></td>
</tr>
<tr>
<td>(1)</td>
<td>2 12</td>
<td>1</td>
<td>layer</td>
<td>39.98</td>
</tr>
<tr>
<td>(2)</td>
<td>4 6</td>
<td>1</td>
<td>attention</td>
<td>39.90</td>
</tr>
<tr>
<td>(3)</td>
<td>8 3</td>
<td>1</td>
<td>attention</td>
<td>37.77</td>
</tr>
<tr>
<td>(4)</td>
<td>4 12</td>
<td>1</td>
<td>attention</td>
<td>39.19 <strong>(BPipe)</strong></td>
</tr>
<tr>
<td>(5)</td>
<td>8 6</td>
<td>1</td>
<td>attention</td>
<td>38.45</td>
</tr>
</tbody>
</table>

GPT-3 134B

<table>
<thead>
<tr>
<th>Model ID</th>
<th>Model Parallelism</th>
<th>mb</th>
<th>Recompute scope</th>
<th>MFU [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>layer</td>
<td></td>
</tr>
<tr>
<td>(1)</td>
<td>2 8</td>
<td>1</td>
<td>layer</td>
<td>38.64</td>
</tr>
<tr>
<td>(2)</td>
<td>4 12</td>
<td>1</td>
<td>attention</td>
<td>38.81</td>
</tr>
<tr>
<td>(3)</td>
<td>8 3</td>
<td>1</td>
<td>attention</td>
<td>38.19</td>
</tr>
<tr>
<td>(4)</td>
<td>4 12</td>
<td>1</td>
<td>attention</td>
<td>38.45</td>
</tr>
<tr>
<td>(5)</td>
<td>8 6</td>
<td>1</td>
<td>attention</td>
<td>38.72</td>
</tr>
</tbody>
</table>

**52.06 (BPipe)**