Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Weilun Feng; Chuanguang Yang; Haotong Qin; Xiangqi Li; Yu Wang; Zhulin An; Libo Huang; Boyu Diao; Zixiang Zhao; Yongjun Xu; Michele Magno

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16956-16976, 2025.

Abstract

Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by 1.9$\times$.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-feng25q,
  title = 	 {Q-{VD}i{T}: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers},
  author =       {Feng, Weilun and Yang, Chuanguang and Qin, Haotong and Li, Xiangqi and Wang, Yu and An, Zhulin and Huang, Libo and Diao, Boyu and Zhao, Zixiang and Xu, Yongjun and Magno, Michele},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {16956--16976},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/feng25q/feng25q.pdf},
  url = 	 {https://proceedings.mlr.press/v267/feng25q.html},
  abstract = 	 {Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by 1.9$\times$.}
}

Endnote

%0 Conference Paper
%T Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
%A Weilun Feng
%A Chuanguang Yang
%A Haotong Qin
%A Xiangqi Li
%A Yu Wang
%A Zhulin An
%A Libo Huang
%A Boyu Diao
%A Zixiang Zhao
%A Yongjun Xu
%A Michele Magno
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-feng25q
%I PMLR
%P 16956--16976
%U https://proceedings.mlr.press/v267/feng25q.html
%V 267
%X Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by 1.9$\times$.

APA

Feng, W., Yang, C., Qin, H., Li, X., Wang, Y., An, Z., Huang, L., Diao, B., Zhao, Z., Xu, Y. & Magno, M.. (2025). Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16956-16976 Available from https://proceedings.mlr.press/v267/feng25q.html.

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Abstract

Cite this Paper

Related Material