ConTrans: Learning Text-enhanced Local–global Temporal Representations for Zero-shot Temporal Action Localization

Kanchan Keisham, Akilan Thangarajah, Pathmanathan Thenukan
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:354-365, 2026.

Abstract

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THU-MOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-keisham26a, title = {ConTrans: Learning Text-enhanced Local–global Temporal Representations for Zero-shot Temporal Action Localization}, author = {Keisham, Kanchan and Thangarajah, Akilan and Thenukan, Pathmanathan}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {354--365}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/keisham26a/keisham26a.pdf}, url = {https://proceedings.mlr.press/v318/keisham26a.html}, abstract = {Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THU-MOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.} }
Endnote
%0 Conference Paper %T ConTrans: Learning Text-enhanced Local–global Temporal Representations for Zero-shot Temporal Action Localization %A Kanchan Keisham %A Akilan Thangarajah %A Pathmanathan Thenukan %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-keisham26a %I PMLR %P 354--365 %U https://proceedings.mlr.press/v318/keisham26a.html %V 318 %X Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THU-MOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.
APA
Keisham, K., Thangarajah, A. & Thenukan, P.. (2026). ConTrans: Learning Text-enhanced Local–global Temporal Representations for Zero-shot Temporal Action Localization. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:354-365 Available from https://proceedings.mlr.press/v318/keisham26a.html.

Related Material