Enhancing Video Representation Learning with Temporal Differentiation

Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, Zhongming Liu
Conference on Parsimony and Learning, PMLR 280:1007-1034, 2025.

Abstract

Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: **Vi**deo Time-**Di**fferentiation for Instance **Di**scrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v280-chen25c, title = {Enhancing Video Representation Learning with Temporal Differentiation}, author = {Chen, Siyi and Choi, Minkyu and Zhao, Zesen and Han, Kuan and Qu, Qing and Liu, Zhongming}, booktitle = {Conference on Parsimony and Learning}, pages = {1007--1034}, year = {2025}, editor = {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui}, volume = {280}, series = {Proceedings of Machine Learning Research}, month = {24--27 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v280/main/assets/chen25c/chen25c.pdf}, url = {https://proceedings.mlr.press/v280/chen25c.html}, abstract = {Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: **Vi**deo Time-**Di**fferentiation for Instance **Di**scrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.} }
Endnote
%0 Conference Paper %T Enhancing Video Representation Learning with Temporal Differentiation %A Siyi Chen %A Minkyu Choi %A Zesen Zhao %A Kuan Han %A Qing Qu %A Zhongming Liu %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2025 %E Beidi Chen %E Shijia Liu %E Mert Pilanci %E Weijie Su %E Jeremias Sulam %E Yuxiang Wang %E Zhihui Zhu %F pmlr-v280-chen25c %I PMLR %P 1007--1034 %U https://proceedings.mlr.press/v280/chen25c.html %V 280 %X Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: **Vi**deo Time-**Di**fferentiation for Instance **Di**scrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
APA
Chen, S., Choi, M., Zhao, Z., Han, K., Qu, Q. & Liu, Z.. (2025). Enhancing Video Representation Learning with Temporal Differentiation. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:1007-1034 Available from https://proceedings.mlr.press/v280/chen25c.html.

Related Material