Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8145-8171, 2025.

Abstract

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76$\times$ faster training and 31$\times$ higher inference throughput for 512$\times$512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25v, title = {Masked Autoencoders Are Effective Tokenizers for Diffusion Models}, author = {Chen, Hao and Han, Yujin and Chen, Fangyi and Li, Xiang and Wang, Yidong and Wang, Jindong and Wang, Ze and Liu, Zicheng and Zou, Difan and Raj, Bhiksha}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {8145--8171}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25v/chen25v.pdf}, url = {https://proceedings.mlr.press/v267/chen25v.html}, abstract = {Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76$\times$ faster training and 31$\times$ higher inference throughput for 512$\times$512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.} }
Endnote
%0 Conference Paper %T Masked Autoencoders Are Effective Tokenizers for Diffusion Models %A Hao Chen %A Yujin Han %A Fangyi Chen %A Xiang Li %A Yidong Wang %A Jindong Wang %A Ze Wang %A Zicheng Liu %A Difan Zou %A Bhiksha Raj %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25v %I PMLR %P 8145--8171 %U https://proceedings.mlr.press/v267/chen25v.html %V 267 %X Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76$\times$ faster training and 31$\times$ higher inference throughput for 512$\times$512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.
APA
Chen, H., Han, Y., Chen, F., Li, X., Wang, Y., Wang, J., Wang, Z., Liu, Z., Zou, D. & Raj, B.. (2025). Masked Autoencoders Are Effective Tokenizers for Diffusion Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8145-8171 Available from https://proceedings.mlr.press/v267/chen25v.html.

Related Material