MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11905-11917, 2024.

Abstract

Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99%$ SLO attainment. The code is available at: https://github.com/hao-ai-lab/MuxServe.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-duan24a, title = {{M}ux{S}erve: Flexible Spatial-Temporal Multiplexing for Multiple {LLM} Serving}, author = {Duan, Jiangfei and Lu, Runyu and Duanmu, Haojie and Li, Xiuhong and Zhang, Xingcheng and Lin, Dahua and Stoica, Ion and Zhang, Hao}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {11905--11917}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/duan24a/duan24a.pdf}, url = {https://proceedings.mlr.press/v235/duan24a.html}, abstract = {Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99%$ SLO attainment. The code is available at: https://github.com/hao-ai-lab/MuxServe.} }
Endnote
%0 Conference Paper %T MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving %A Jiangfei Duan %A Runyu Lu %A Haojie Duanmu %A Xiuhong Li %A Xingcheng Zhang %A Dahua Lin %A Ion Stoica %A Hao Zhang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-duan24a %I PMLR %P 11905--11917 %U https://proceedings.mlr.press/v235/duan24a.html %V 235 %X Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99%$ SLO attainment. The code is available at: https://github.com/hao-ai-lab/MuxServe.
APA
Duan, J., Lu, R., Duanmu, H., Li, X., Zhang, X., Lin, D., Stoica, I. & Zhang, H.. (2024). MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:11905-11917 Available from https://proceedings.mlr.press/v235/duan24a.html.

Related Material