HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:21946-21961, 2024.

Abstract

Serving generative inference of the large language model is a crucial component of contemporary AI applications. In this paper, our focus lies in deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism, which allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive empirical study to evaluate the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The experimental results suggest that HexGen can choose to achieve up to $2.3\times$ lower latency deadlines or tolerate up to $4\times$ more traffic request rates compared with the homogeneous baseline given the same budget.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-jiang24f, title = {{H}ex{G}en: Generative Inference of Large Language Model over Heterogeneous Environment}, author = {Jiang, Youhe and Yan, Ran and Yao, Xiaozhe and Zhou, Yang and Chen, Beidi and Yuan, Binhang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {21946--21961}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/jiang24f/jiang24f.pdf}, url = {https://proceedings.mlr.press/v235/jiang24f.html}, abstract = {Serving generative inference of the large language model is a crucial component of contemporary AI applications. In this paper, our focus lies in deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism, which allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive empirical study to evaluate the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The experimental results suggest that HexGen can choose to achieve up to $2.3\times$ lower latency deadlines or tolerate up to $4\times$ more traffic request rates compared with the homogeneous baseline given the same budget.} }
Endnote
%0 Conference Paper %T HexGen: Generative Inference of Large Language Model over Heterogeneous Environment %A Youhe Jiang %A Ran Yan %A Xiaozhe Yao %A Yang Zhou %A Beidi Chen %A Binhang Yuan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-jiang24f %I PMLR %P 21946--21961 %U https://proceedings.mlr.press/v235/jiang24f.html %V 235 %X Serving generative inference of the large language model is a crucial component of contemporary AI applications. In this paper, our focus lies in deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism, which allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive empirical study to evaluate the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The experimental results suggest that HexGen can choose to achieve up to $2.3\times$ lower latency deadlines or tolerate up to $4\times$ more traffic request rates compared with the homogeneous baseline given the same budget.
APA
Jiang, Y., Yan, R., Yao, X., Zhou, Y., Chen, B. & Yuan, B.. (2024). HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:21946-21961 Available from https://proceedings.mlr.press/v235/jiang24f.html.

Related Material