QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

Hamidreza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:26433-26445, 2025.

Abstract

The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single GPU. We propose a serving system that employs similarity-based expert consolidation to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce runtime partial reconfiguration, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves competitive output quality while maintaining throughput comparable to serving a single model, and incurs only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA’s multi-instance GPU (MIG). Furthermore, experiments on Google’s Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-imani25a, title = {{Q}o{S}-Efficient Serving of Multiple Mixture-of-Expert {LLM}s Using Partial Runtime Reconfiguration}, author = {Imani, Hamidreza and Peng, Jiaxin and Mohseni, Peiman and Amirany, Abdolah and El-Ghazawi, Tarek}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {26433--26445}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/imani25a/imani25a.pdf}, url = {https://proceedings.mlr.press/v267/imani25a.html}, abstract = {The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single GPU. We propose a serving system that employs similarity-based expert consolidation to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce runtime partial reconfiguration, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves competitive output quality while maintaining throughput comparable to serving a single model, and incurs only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA’s multi-instance GPU (MIG). Furthermore, experiments on Google’s Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.} }
Endnote
%0 Conference Paper %T QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration %A Hamidreza Imani %A Jiaxin Peng %A Peiman Mohseni %A Abdolah Amirany %A Tarek El-Ghazawi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-imani25a %I PMLR %P 26433--26445 %U https://proceedings.mlr.press/v267/imani25a.html %V 267 %X The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single GPU. We propose a serving system that employs similarity-based expert consolidation to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce runtime partial reconfiguration, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves competitive output quality while maintaining throughput comparable to serving a single model, and incurs only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA’s multi-instance GPU (MIG). Furthermore, experiments on Google’s Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.
APA
Imani, H., Peng, J., Mohseni, P., Amirany, A. & El-Ghazawi, T.. (2025). QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:26433-26445 Available from https://proceedings.mlr.press/v267/imani25a.html.

Related Material