Spatial Temporal Enhanced Contrastive and Pretext Learning for Skeleton-based Action Representation

Yiwen Zhan, Yuchen Chen, Pengfei Ren, Haifeng Sun, Jingyu Wang, Qi Qi, Jianxin Liao
Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:534-547, 2021.

Abstract

In this paper, we focus on unsupervised representation learning for skeleton-based action recognition. The critical issue of this task is extracting discriminative spatial-temporal information from skeleton sequences to form action representation. To better solve this, we propose a novel unsupervised framework named contrastive-pretext spatial-temporal network (CP-STN), aiming to achieve accurate action recognition by better exploiting discriminative spatial-temporal enhanced features from massive unlabeled data. We combine contrastive and pretext tasks learning paradigms in one framework by using asymmetric spatial and temporal augmentations to enable network extracting discriminative representations with spatial-temporal information fully. Furthermore, graph-based convolution is used as the backbone to explore natural spatial-temporal graph information in skeleton data. Extensive experimental results show that our CP-STN significantly boosts the performance of existing skeleton-based action representations learning networks and achieves state-of-the-art accuracy on two challenging benchmarks in both unsupervised and semi-supervised settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v157-zhan21a, title = {Spatial Temporal Enhanced Contrastive and Pretext Learning for Skeleton-based Action Representation}, author = {Zhan, Yiwen and Chen, Yuchen and Ren, Pengfei and Sun, Haifeng and Wang, Jingyu and Qi, Qi and Liao, Jianxin}, booktitle = {Proceedings of The 13th Asian Conference on Machine Learning}, pages = {534--547}, year = {2021}, editor = {Balasubramanian, Vineeth N. and Tsang, Ivor}, volume = {157}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v157/zhan21a/zhan21a.pdf}, url = {https://proceedings.mlr.press/v157/zhan21a.html}, abstract = {In this paper, we focus on unsupervised representation learning for skeleton-based action recognition. The critical issue of this task is extracting discriminative spatial-temporal information from skeleton sequences to form action representation. To better solve this, we propose a novel unsupervised framework named contrastive-pretext spatial-temporal network (CP-STN), aiming to achieve accurate action recognition by better exploiting discriminative spatial-temporal enhanced features from massive unlabeled data. We combine contrastive and pretext tasks learning paradigms in one framework by using asymmetric spatial and temporal augmentations to enable network extracting discriminative representations with spatial-temporal information fully. Furthermore, graph-based convolution is used as the backbone to explore natural spatial-temporal graph information in skeleton data. Extensive experimental results show that our CP-STN significantly boosts the performance of existing skeleton-based action representations learning networks and achieves state-of-the-art accuracy on two challenging benchmarks in both unsupervised and semi-supervised settings.} }
Endnote
%0 Conference Paper %T Spatial Temporal Enhanced Contrastive and Pretext Learning for Skeleton-based Action Representation %A Yiwen Zhan %A Yuchen Chen %A Pengfei Ren %A Haifeng Sun %A Jingyu Wang %A Qi Qi %A Jianxin Liao %B Proceedings of The 13th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Vineeth N. Balasubramanian %E Ivor Tsang %F pmlr-v157-zhan21a %I PMLR %P 534--547 %U https://proceedings.mlr.press/v157/zhan21a.html %V 157 %X In this paper, we focus on unsupervised representation learning for skeleton-based action recognition. The critical issue of this task is extracting discriminative spatial-temporal information from skeleton sequences to form action representation. To better solve this, we propose a novel unsupervised framework named contrastive-pretext spatial-temporal network (CP-STN), aiming to achieve accurate action recognition by better exploiting discriminative spatial-temporal enhanced features from massive unlabeled data. We combine contrastive and pretext tasks learning paradigms in one framework by using asymmetric spatial and temporal augmentations to enable network extracting discriminative representations with spatial-temporal information fully. Furthermore, graph-based convolution is used as the backbone to explore natural spatial-temporal graph information in skeleton data. Extensive experimental results show that our CP-STN significantly boosts the performance of existing skeleton-based action representations learning networks and achieves state-of-the-art accuracy on two challenging benchmarks in both unsupervised and semi-supervised settings.
APA
Zhan, Y., Chen, Y., Ren, P., Sun, H., Wang, J., Qi, Q. & Liao, J.. (2021). Spatial Temporal Enhanced Contrastive and Pretext Learning for Skeleton-based Action Representation. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:534-547 Available from https://proceedings.mlr.press/v157/zhan21a.html.

Related Material