Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion

Yangfan He, Sida Li, Jianhui Wang, Xinyuan Song, Kun Li, Xinhang Yuan, Kuan Lu, Menghao Huo, Jingqun Tang, Yi Xin, Jiaqi Chen, Keqin Li, Miao Zhang, Xueqian Wang
Conference on Parsimony and Learning, PMLR 328:131-163, 2026.

Abstract

Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal, spatial and semantic consistency with Baliteral Denoising Diffusion Implicit Models (DDIM) inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions using temporally aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; (3) a Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment through a combination of shared prompt tokens and frame-specific tokens. Extensive experiments on multiple datasets demonstrate that our method significantly improves perceptual quality, text-image relevance, and temporal coherence. The proposed approach offers a practical and efficient solution for text-to-video (T2V) editing. Our code is available in the supplementary materials.

Cite this Paper


BibTeX
@InProceedings{pmlr-v328-he26a, title = {Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion}, author = {He, Yangfan and Li, Sida and Wang, Jianhui and Song, Xinyuan and Li, Kun and Yuan, Xinhang and Lu, Kuan and Huo, Menghao and Tang, Jingqun and Xin, Yi and Chen, Jiaqi and Li, Keqin and Zhang, Miao and Wang, Xueqian}, booktitle = {Conference on Parsimony and Learning}, pages = {131--163}, year = {2026}, editor = {Burkholz, Rebekka and Liu, Shiwei and Ravishankar, Saiprasad and Redman, William and Huang, Wei and Su, Weijie and Zhu, Zhihui}, volume = {328}, series = {Proceedings of Machine Learning Research}, month = {23--26 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v328/main/assets/he26a/he26a.pdf}, url = {https://proceedings.mlr.press/v328/he26a.html}, abstract = {Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal, spatial and semantic consistency with Baliteral Denoising Diffusion Implicit Models (DDIM) inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions using temporally aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; (3) a Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment through a combination of shared prompt tokens and frame-specific tokens. Extensive experiments on multiple datasets demonstrate that our method significantly improves perceptual quality, text-image relevance, and temporal coherence. The proposed approach offers a practical and efficient solution for text-to-video (T2V) editing. Our code is available in the supplementary materials.} }
Endnote
%0 Conference Paper %T Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion %A Yangfan He %A Sida Li %A Jianhui Wang %A Xinyuan Song %A Kun Li %A Xinhang Yuan %A Kuan Lu %A Menghao Huo %A Jingqun Tang %A Yi Xin %A Jiaqi Chen %A Keqin Li %A Miao Zhang %A Xueqian Wang %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2026 %E Rebekka Burkholz %E Shiwei Liu %E Saiprasad Ravishankar %E William Redman %E Wei Huang %E Weijie Su %E Zhihui Zhu %F pmlr-v328-he26a %I PMLR %P 131--163 %U https://proceedings.mlr.press/v328/he26a.html %V 328 %X Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal, spatial and semantic consistency with Baliteral Denoising Diffusion Implicit Models (DDIM) inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions using temporally aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; (3) a Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment through a combination of shared prompt tokens and frame-specific tokens. Extensive experiments on multiple datasets demonstrate that our method significantly improves perceptual quality, text-image relevance, and temporal coherence. The proposed approach offers a practical and efficient solution for text-to-video (T2V) editing. Our code is available in the supplementary materials.
APA
He, Y., Li, S., Wang, J., Song, X., Li, K., Yuan, X., Lu, K., Huo, M., Tang, J., Xin, Y., Chen, J., Li, K., Zhang, M. & Wang, X.. (2026). Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 328:131-163 Available from https://proceedings.mlr.press/v328/he26a.html.

Related Material