Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts

Kun Cheng, Xiao He, Lei Yu, Zhijun Tu, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10010-10024, 2025.

Abstract

Diffusion models have transformed generative modeling but suffer from scalability limitations due to computational overhead and inflexible architectures that process all generative stages and tokens uniformly. In this work, we introduce Diff-MoE, a novel framework that combines Diffusion Transformers with Mixture-of-Experts to exploit both temporarily adaptability and spatial flexibility. Our design incorporates expert-specific timestep conditioning, allowing each expert to process different spatial tokens while adapting to the generative stage, to dynamically allocate resources based on both the temporal and spatial characteristics of the generative task. Additionally, we propose a globally-aware feature recalibration mechanism that amplifies the representational capacity of expert modules by dynamically adjusting feature contributions based on input relevance. Extensive experiments on image generation benchmarks demonstrate that Diff-MoE significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-cheng25d, title = {Diff-{M}o{E}: Diffusion Transformer with Time-Aware and Space-Adaptive Experts}, author = {Cheng, Kun and He, Xiao and Yu, Lei and Tu, Zhijun and Zhu, Mingrui and Wang, Nannan and Gao, Xinbo and Hu, Jie}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10010--10024}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/cheng25d/cheng25d.pdf}, url = {https://proceedings.mlr.press/v267/cheng25d.html}, abstract = {Diffusion models have transformed generative modeling but suffer from scalability limitations due to computational overhead and inflexible architectures that process all generative stages and tokens uniformly. In this work, we introduce Diff-MoE, a novel framework that combines Diffusion Transformers with Mixture-of-Experts to exploit both temporarily adaptability and spatial flexibility. Our design incorporates expert-specific timestep conditioning, allowing each expert to process different spatial tokens while adapting to the generative stage, to dynamically allocate resources based on both the temporal and spatial characteristics of the generative task. Additionally, we propose a globally-aware feature recalibration mechanism that amplifies the representational capacity of expert modules by dynamically adjusting feature contributions based on input relevance. Extensive experiments on image generation benchmarks demonstrate that Diff-MoE significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling.} }
Endnote
%0 Conference Paper %T Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts %A Kun Cheng %A Xiao He %A Lei Yu %A Zhijun Tu %A Mingrui Zhu %A Nannan Wang %A Xinbo Gao %A Jie Hu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-cheng25d %I PMLR %P 10010--10024 %U https://proceedings.mlr.press/v267/cheng25d.html %V 267 %X Diffusion models have transformed generative modeling but suffer from scalability limitations due to computational overhead and inflexible architectures that process all generative stages and tokens uniformly. In this work, we introduce Diff-MoE, a novel framework that combines Diffusion Transformers with Mixture-of-Experts to exploit both temporarily adaptability and spatial flexibility. Our design incorporates expert-specific timestep conditioning, allowing each expert to process different spatial tokens while adapting to the generative stage, to dynamically allocate resources based on both the temporal and spatial characteristics of the generative task. Additionally, we propose a globally-aware feature recalibration mechanism that amplifies the representational capacity of expert modules by dynamically adjusting feature contributions based on input relevance. Extensive experiments on image generation benchmarks demonstrate that Diff-MoE significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling.
APA
Cheng, K., He, X., Yu, L., Tu, Z., Zhu, M., Wang, N., Gao, X. & Hu, J.. (2025). Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10010-10024 Available from https://proceedings.mlr.press/v267/cheng25d.html.

Related Material