Temporal RPN Learning for Weakly-Supervised Temporal Action Localization

Jing Huang, Ming Kong, Luyuan Chen, Tian Liang, Qiang Zhu
Proceedings of the 15th Asian Conference on Machine Learning, PMLR 222:470-485, 2024.

Abstract

Weakly-Supervised Temporal Action Localization (WSTAL) aims to train an action instance localization model from untrimmed videos with only video-level labels, similar to the Object Detection (OD) task. Existing Top-k MIL-based WSTAL methods cannot flexibly define the learning space, which limits the model’s learning efficiency and performance. Faster R-CNN is a classic two-stage object detection architecture with an efficient Region Proposal Network. This paper successfully migrates the Faster R-CNN liked two-stage architecture to the WSTAL task: first to build a T-RPN and integrate it with the traditional WSTAL framework; and then to propose a pseudo label generation mechanism to enable the T-RPN learning without temporal annotations. Our new framework has achieved breakthrough performances on THUMOS-14 and ActivityNet-v1.2 datasets, and comprehensive ablation experiments have verified the effectiveness of the innovations. Code will be available at: \href{https://github.com/ZJUHJ/TRPN}{https://github.com/ZJUHJ/TRPN}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v222-huang24a, title = {Temporal RPN Learning for Weakly-Supervised Temporal Action Localization}, author = {Huang, Jing and Kong, Ming and Chen, Luyuan and Liang, Tian and Zhu, Qiang}, booktitle = {Proceedings of the 15th Asian Conference on Machine Learning}, pages = {470--485}, year = {2024}, editor = {Yanıkoğlu, Berrin and Buntine, Wray}, volume = {222}, series = {Proceedings of Machine Learning Research}, month = {11--14 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v222/huang24a/huang24a.pdf}, url = {https://proceedings.mlr.press/v222/huang24a.html}, abstract = {Weakly-Supervised Temporal Action Localization (WSTAL) aims to train an action instance localization model from untrimmed videos with only video-level labels, similar to the Object Detection (OD) task. Existing Top-k MIL-based WSTAL methods cannot flexibly define the learning space, which limits the model’s learning efficiency and performance. Faster R-CNN is a classic two-stage object detection architecture with an efficient Region Proposal Network. This paper successfully migrates the Faster R-CNN liked two-stage architecture to the WSTAL task: first to build a T-RPN and integrate it with the traditional WSTAL framework; and then to propose a pseudo label generation mechanism to enable the T-RPN learning without temporal annotations. Our new framework has achieved breakthrough performances on THUMOS-14 and ActivityNet-v1.2 datasets, and comprehensive ablation experiments have verified the effectiveness of the innovations. Code will be available at: \href{https://github.com/ZJUHJ/TRPN}{https://github.com/ZJUHJ/TRPN}.} }
Endnote
%0 Conference Paper %T Temporal RPN Learning for Weakly-Supervised Temporal Action Localization %A Jing Huang %A Ming Kong %A Luyuan Chen %A Tian Liang %A Qiang Zhu %B Proceedings of the 15th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Berrin Yanıkoğlu %E Wray Buntine %F pmlr-v222-huang24a %I PMLR %P 470--485 %U https://proceedings.mlr.press/v222/huang24a.html %V 222 %X Weakly-Supervised Temporal Action Localization (WSTAL) aims to train an action instance localization model from untrimmed videos with only video-level labels, similar to the Object Detection (OD) task. Existing Top-k MIL-based WSTAL methods cannot flexibly define the learning space, which limits the model’s learning efficiency and performance. Faster R-CNN is a classic two-stage object detection architecture with an efficient Region Proposal Network. This paper successfully migrates the Faster R-CNN liked two-stage architecture to the WSTAL task: first to build a T-RPN and integrate it with the traditional WSTAL framework; and then to propose a pseudo label generation mechanism to enable the T-RPN learning without temporal annotations. Our new framework has achieved breakthrough performances on THUMOS-14 and ActivityNet-v1.2 datasets, and comprehensive ablation experiments have verified the effectiveness of the innovations. Code will be available at: \href{https://github.com/ZJUHJ/TRPN}{https://github.com/ZJUHJ/TRPN}.
APA
Huang, J., Kong, M., Chen, L., Liang, T. & Zhu, Q.. (2024). Temporal RPN Learning for Weakly-Supervised Temporal Action Localization. Proceedings of the 15th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 222:470-485 Available from https://proceedings.mlr.press/v222/huang24a.html.

Related Material