Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Haocheng Xi; Shuo Yang; Yilong Zhao; Chenfeng Xu; Muyang Li; Xiuyu Li; Yujun Lin; Han Cai; Jintao Zhang; Dacheng Li; Jianfei Chen; Ion Stoica; Kurt Keutzer; Song Han

Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:68208-68224, 2025.

Abstract

Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D full attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D full attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28$\times$ and 2.33$\times$ end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code will be open-sourced upon publication.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-xi25c,
  title = 	 {Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity},
  author =       {Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Xu, Chenfeng and Li, Muyang and Li, Xiuyu and Lin, Yujun and Cai, Han and Zhang, Jintao and Li, Dacheng and Chen, Jianfei and Stoica, Ion and Keutzer, Kurt and Han, Song},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {68208--68224},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xi25c/xi25c.pdf},
  url = 	 {https://proceedings.mlr.press/v267/xi25c.html},
  abstract = 	 {Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D full attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D full attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28$\times$ and 2.33$\times$ end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code will be open-sourced upon publication.}
}

Endnote

%0 Conference Paper
%T Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
%A Haocheng Xi
%A Shuo Yang
%A Yilong Zhao
%A Chenfeng Xu
%A Muyang Li
%A Xiuyu Li
%A Yujun Lin
%A Han Cai
%A Jintao Zhang
%A Dacheng Li
%A Jianfei Chen
%A Ion Stoica
%A Kurt Keutzer
%A Song Han
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-xi25c
%I PMLR
%P 68208--68224
%U https://proceedings.mlr.press/v267/xi25c.html
%V 267
%X Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D full attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D full attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28$\times$ and 2.33$\times$ end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code will be open-sourced upon publication.

APA

Xi, H., Yang, S., Zhao, Y., Xu, C., Li, M., Li, X., Lin, Y., Cai, H., Zhang, J., Li, D., Chen, J., Stoica, I., Keutzer, K. & Han, S.. (2025). Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:68208-68224 Available from https://proceedings.mlr.press/v267/xi25c.html.

Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Abstract

Cite this Paper

Related Material