BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition

Fei Long, Xiaoou Li, Jiaming Lv, Haoyuan Yang, Xianjun Cheng, Peihua Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:40253-40269, 2025.

Abstract

Bridging contrastive language-image pre-training (CLIP) to video action recognition has attracted growing interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements and modeling their relationships with language. However, most existing methods rely on cosine similarity–practically equivalent to the Pearson correlation coefficient–between global tokens for video-language alignment. As a result, they have limited capacity to model complex dependencies and tend to overlook local tokens that encode critical spatio-temporal cues. To overcome these limitations, we propose BDC-CLIP, a novel framework that leverages Brownian Distance Covariance (BDC) to align visual and textual representations. Our method can capture complex relationships–both linear and nonlinear–between all visual and textual tokens, enabling fine-grained modeling in space, time, and language. BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully supervised action recognition settings, demonstrating its effectiveness and broad applicability.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-long25a, title = {{BDC}-{CLIP}: Brownian Distance Covariance for Adapting {CLIP} to Action Recognition}, author = {Long, Fei and Li, Xiaoou and Lv, Jiaming and Yang, Haoyuan and Cheng, Xianjun and Li, Peihua}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {40253--40269}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/long25a/long25a.pdf}, url = {https://proceedings.mlr.press/v267/long25a.html}, abstract = {Bridging contrastive language-image pre-training (CLIP) to video action recognition has attracted growing interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements and modeling their relationships with language. However, most existing methods rely on cosine similarity–practically equivalent to the Pearson correlation coefficient–between global tokens for video-language alignment. As a result, they have limited capacity to model complex dependencies and tend to overlook local tokens that encode critical spatio-temporal cues. To overcome these limitations, we propose BDC-CLIP, a novel framework that leverages Brownian Distance Covariance (BDC) to align visual and textual representations. Our method can capture complex relationships–both linear and nonlinear–between all visual and textual tokens, enabling fine-grained modeling in space, time, and language. BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully supervised action recognition settings, demonstrating its effectiveness and broad applicability.} }
Endnote
%0 Conference Paper %T BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition %A Fei Long %A Xiaoou Li %A Jiaming Lv %A Haoyuan Yang %A Xianjun Cheng %A Peihua Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-long25a %I PMLR %P 40253--40269 %U https://proceedings.mlr.press/v267/long25a.html %V 267 %X Bridging contrastive language-image pre-training (CLIP) to video action recognition has attracted growing interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements and modeling their relationships with language. However, most existing methods rely on cosine similarity–practically equivalent to the Pearson correlation coefficient–between global tokens for video-language alignment. As a result, they have limited capacity to model complex dependencies and tend to overlook local tokens that encode critical spatio-temporal cues. To overcome these limitations, we propose BDC-CLIP, a novel framework that leverages Brownian Distance Covariance (BDC) to align visual and textual representations. Our method can capture complex relationships–both linear and nonlinear–between all visual and textual tokens, enabling fine-grained modeling in space, time, and language. BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully supervised action recognition settings, demonstrating its effectiveness and broad applicability.
APA
Long, F., Li, X., Lv, J., Yang, H., Cheng, X. & Li, P.. (2025). BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:40253-40269 Available from https://proceedings.mlr.press/v267/long25a.html.

Related Material