Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:73671-73682, 2025.

Abstract

Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yuan25j, title = {Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts}, author = {Yuan, Yike and Wang, Ziyu and Huang, Zihao and Zhu, Defa and Zhou, Xun and Yu, Jingyi and Min, Qiyang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {73671--73682}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yuan25j/yuan25j.pdf}, url = {https://proceedings.mlr.press/v267/yuan25j.html}, abstract = {Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.} }
Endnote
%0 Conference Paper %T Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts %A Yike Yuan %A Ziyu Wang %A Zihao Huang %A Defa Zhu %A Xun Zhou %A Jingyi Yu %A Qiyang Min %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yuan25j %I PMLR %P 73671--73682 %U https://proceedings.mlr.press/v267/yuan25j.html %V 267 %X Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.
APA
Yuan, Y., Wang, Z., Huang, Z., Zhu, D., Zhou, X., Yu, J. & Min, Q.. (2025). Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:73671-73682 Available from https://proceedings.mlr.press/v267/yuan25j.html.

Related Material