Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training

Yechan Kim; Hwijoon Lim; Dongsu Han

Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training

Yechan Kim, Hwijoon Lim, Dongsu Han

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:24342-24353, 2024.

Abstract

Mixture-of-Experts (MoE) is a powerful technique for enhancing the performance of neural networks while decoupling computational complexity from the number of parameters. However, despite this, scaling the number of experts requires adding more GPUs. In addition, the load imbalance in token load across experts causes unnecessary computation or straggler problems. We present ES-MoE, a novel method for efficient scaling MoE training. It offloads expert parameters to host memory and leverages pipelined expert processing to overlap GPU-CPU communication with GPU computation. It dynamically balances token loads across GPUs, improving computational efficiency. ES-MoE accelerates MoE training on a limited number of GPUs without degradation in model performance. We validate our approach on GPT-based MoE models, demonstrating 67

$\times$ better scalability and up to 17.5

$\times$ better throughput over existing frameworks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-kim24w,
  title = 	 {Scaling Beyond the {GPU} Memory Limit for Large Mixture-of-Experts Model Training},
  author =       {Kim, Yechan and Lim, Hwijoon and Han, Dongsu},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {24342--24353},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24w/kim24w.pdf},
  url = 	 {https://proceedings.mlr.press/v235/kim24w.html},
  abstract = 	 {Mixture-of-Experts (MoE) is a powerful technique for enhancing the performance of neural networks while decoupling computational complexity from the number of parameters. However, despite this, scaling the number of experts requires adding more GPUs. In addition, the load imbalance in token load across experts causes unnecessary computation or straggler problems. We present ES-MoE, a novel method for efficient scaling MoE training. It offloads expert parameters to host memory and leverages pipelined expert processing to overlap GPU-CPU communication with GPU computation. It dynamically balances token loads across GPUs, improving computational efficiency. ES-MoE accelerates MoE training on a limited number of GPUs without degradation in model performance. We validate our approach on GPT-based MoE models, demonstrating 67$\times$ better scalability and up to 17.5$\times$ better throughput over existing frameworks.}
}

Endnote

%0 Conference Paper
%T Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
%A Yechan Kim
%A Hwijoon Lim
%A Dongsu Han
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-kim24w
%I PMLR
%P 24342--24353
%U https://proceedings.mlr.press/v235/kim24w.html
%V 235
%X Mixture-of-Experts (MoE) is a powerful technique for enhancing the performance of neural networks while decoupling computational complexity from the number of parameters. However, despite this, scaling the number of experts requires adding more GPUs. In addition, the load imbalance in token load across experts causes unnecessary computation or straggler problems. We present ES-MoE, a novel method for efficient scaling MoE training. It offloads expert parameters to host memory and leverages pipelined expert processing to overlap GPU-CPU communication with GPU computation. It dynamically balances token loads across GPUs, improving computational efficiency. ES-MoE accelerates MoE training on a limited number of GPUs without degradation in model performance. We validate our approach on GPT-based MoE models, demonstrating 67$\times$ better scalability and up to 17.5$\times$ better throughput over existing frameworks.

APA


Kim, Y., Lim, H. & Han, D.. (2024). Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:24342-24353 Available from https://proceedings.mlr.press/v235/kim24w.html.

Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training

Abstract

Cite this Paper

Related Material