Local Shuffled Skeleton Position Embedding Vision Transformer for Human Activity Recognition

Zihui Yan; Xiyu Shi; Varuna De SIlva De SIlva

Local Shuffled Skeleton Position Embedding Vision Transformer for Human Activity Recognition

Zihui Yan, Xiyu Shi, Varuna De SIlva De SIlva

Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:431-446, 2025.

Abstract

Vision Transformers (ViTs) in human activity recognition tasks suffer from inadequate spatial modeling through conventional position embeddings, leading to over-reliance on fixed positional information. This paper proposes Shuffled Positional Embedding (SPE), a mechanism that randomly disrupts the order of positional encoding during each forward propagation, reducing model dependence on position embedding and encouraging exploration of intrinsic spatial relationships. While SPE enhances general spatial awareness, it lacks targeted guidance for human-centric modeling. To address this limitation, Local Shuffled Skeleton Position Embedding (LSSPE) is developed, which leverages 2D skeleton data to provide human body structure-aware spatial representation. LSSPE computes attention weights based on spatial distances between image patches and skeleton keypoints, incorporating joint motion amplitudes for enhanced modeling. To further utilize skeleton data, a dual-stream architecture is designed combining TimeSFormer with LSSPE (LSSPE-TimeSFormer) for RGB processing and SkateFormer for skeleton processing. The proposed dual-stream model achieves outstanding performance of 95.8% and 98.7% accuracy on NTU RGB+D cross-subject and cross-view settings, establishing the effectiveness of skeleton-aware position embedding for human activity recognition.

Cite this Paper

BibTeX

@InProceedings{pmlr-v304-yan25a,
  title = 	 {Local Shuffled Skeleton Position Embedding Vision Transformer for Human Activity Recognition},
  author =       {Yan, Zihui and Shi, Xiyu and SIlva, Varuna De SIlva De},
  booktitle = 	 {Proceedings of the 17th Asian Conference on Machine Learning},
  pages = 	 {431--446},
  year = 	 {2025},
  editor = 	 {Lee, Hung-yi and Liu, Tongliang},
  volume = 	 {304},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v304/main/assets/yan25a/yan25a.pdf},
  url = 	 {https://proceedings.mlr.press/v304/yan25a.html},
  abstract = 	 {Vision Transformers (ViTs) in human activity recognition tasks suffer from inadequate spatial modeling through conventional position embeddings, leading to over-reliance on fixed positional information. This paper proposes Shuffled Positional Embedding (SPE), a mechanism that randomly disrupts the order of positional encoding during each forward propagation, reducing model dependence on position embedding and encouraging exploration of intrinsic spatial relationships. While SPE enhances general spatial awareness, it lacks targeted guidance for human-centric modeling. To address this limitation, Local Shuffled Skeleton Position Embedding (LSSPE) is developed, which leverages 2D skeleton data to provide human body structure-aware spatial representation. LSSPE computes attention weights based on spatial distances between image patches and skeleton keypoints, incorporating joint motion amplitudes for enhanced modeling. To further utilize skeleton data, a dual-stream architecture is designed combining TimeSFormer with LSSPE (LSSPE-TimeSFormer) for RGB processing and SkateFormer for skeleton processing. The proposed dual-stream model achieves outstanding performance of 95.8% and 98.7% accuracy on NTU RGB+D cross-subject and cross-view settings, establishing the effectiveness of skeleton-aware position embedding for human activity recognition.}
}

Endnote

%0 Conference Paper
%T Local Shuffled Skeleton Position Embedding Vision Transformer for Human Activity Recognition
%A Zihui Yan
%A Xiyu Shi
%A Varuna De SIlva De SIlva
%B Proceedings of the 17th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Hung-yi Lee
%E Tongliang Liu	
%F pmlr-v304-yan25a
%I PMLR
%P 431--446
%U https://proceedings.mlr.press/v304/yan25a.html
%V 304
%X Vision Transformers (ViTs) in human activity recognition tasks suffer from inadequate spatial modeling through conventional position embeddings, leading to over-reliance on fixed positional information. This paper proposes Shuffled Positional Embedding (SPE), a mechanism that randomly disrupts the order of positional encoding during each forward propagation, reducing model dependence on position embedding and encouraging exploration of intrinsic spatial relationships. While SPE enhances general spatial awareness, it lacks targeted guidance for human-centric modeling. To address this limitation, Local Shuffled Skeleton Position Embedding (LSSPE) is developed, which leverages 2D skeleton data to provide human body structure-aware spatial representation. LSSPE computes attention weights based on spatial distances between image patches and skeleton keypoints, incorporating joint motion amplitudes for enhanced modeling. To further utilize skeleton data, a dual-stream architecture is designed combining TimeSFormer with LSSPE (LSSPE-TimeSFormer) for RGB processing and SkateFormer for skeleton processing. The proposed dual-stream model achieves outstanding performance of 95.8% and 98.7% accuracy on NTU RGB+D cross-subject and cross-view settings, establishing the effectiveness of skeleton-aware position embedding for human activity recognition.

APA

Yan, Z., Shi, X. & SIlva, V.D.S.D.. (2025). Local Shuffled Skeleton Position Embedding Vision Transformer for Human Activity Recognition. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:431-446 Available from https://proceedings.mlr.press/v304/yan25a.html.

Related Material

Download PDF