FTP: A Human Pose Estimation Method Integrating Temporal and Fine-Grained Feature Fusion

Shuqiang Cai, Chennan Ma, Xin Wang, Li Lin, Ming Yan, Xincheng Lin, Shuqi Fan, Siqi Shen
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:858-872, 2025.

Abstract

Human pose estimation is a significant research direction in the field of computer vision, with critical applications in human motion reconstruction and analysis. Currently proposed human pose estimation methods primarily focus on single-modality sensor information, such as RGB images and LiDAR point clouds. While these methods have achieved promising results within their respective domains, they remain limited by the inherent deficiencies of each modality, hindering their applicability across diverse real-world scenarios. With the recent introduction of numerous multi-modality human pose datasets, multi-modality approaches have begun to develop. However, existing multi-modality fusion methods mainly consider the global feature relationships between different modalities, without modeling finer-grained features or the dynamic temporal relationships between modalities. To address this issue, we propose a novel pipeline that integrates point cloud and image features, explicitly encoding fine-grained features and dynamic temporal relationships between the two modalities. Additionally, we employ a discriminator structure for semi-supervised training. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance compared to previous methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v260-cai25a, title = {{FTP}: {A} Human Pose Estimation Method Integrating Temporal and Fine-Grained Feature Fusion}, author = {Cai, Shuqiang and Ma, Chennan and Wang, Xin and Lin, Li and Yan, Ming and Lin, Xincheng and Fan, Shuqi and Shen, Siqi}, booktitle = {Proceedings of the 16th Asian Conference on Machine Learning}, pages = {858--872}, year = {2025}, editor = {Nguyen, Vu and Lin, Hsuan-Tien}, volume = {260}, series = {Proceedings of Machine Learning Research}, month = {05--08 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v260/main/assets/cai25a/cai25a.pdf}, url = {https://proceedings.mlr.press/v260/cai25a.html}, abstract = {Human pose estimation is a significant research direction in the field of computer vision, with critical applications in human motion reconstruction and analysis. Currently proposed human pose estimation methods primarily focus on single-modality sensor information, such as RGB images and LiDAR point clouds. While these methods have achieved promising results within their respective domains, they remain limited by the inherent deficiencies of each modality, hindering their applicability across diverse real-world scenarios. With the recent introduction of numerous multi-modality human pose datasets, multi-modality approaches have begun to develop. However, existing multi-modality fusion methods mainly consider the global feature relationships between different modalities, without modeling finer-grained features or the dynamic temporal relationships between modalities. To address this issue, we propose a novel pipeline that integrates point cloud and image features, explicitly encoding fine-grained features and dynamic temporal relationships between the two modalities. Additionally, we employ a discriminator structure for semi-supervised training. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance compared to previous methods.} }
Endnote
%0 Conference Paper %T FTP: A Human Pose Estimation Method Integrating Temporal and Fine-Grained Feature Fusion %A Shuqiang Cai %A Chennan Ma %A Xin Wang %A Li Lin %A Ming Yan %A Xincheng Lin %A Shuqi Fan %A Siqi Shen %B Proceedings of the 16th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Vu Nguyen %E Hsuan-Tien Lin %F pmlr-v260-cai25a %I PMLR %P 858--872 %U https://proceedings.mlr.press/v260/cai25a.html %V 260 %X Human pose estimation is a significant research direction in the field of computer vision, with critical applications in human motion reconstruction and analysis. Currently proposed human pose estimation methods primarily focus on single-modality sensor information, such as RGB images and LiDAR point clouds. While these methods have achieved promising results within their respective domains, they remain limited by the inherent deficiencies of each modality, hindering their applicability across diverse real-world scenarios. With the recent introduction of numerous multi-modality human pose datasets, multi-modality approaches have begun to develop. However, existing multi-modality fusion methods mainly consider the global feature relationships between different modalities, without modeling finer-grained features or the dynamic temporal relationships between modalities. To address this issue, we propose a novel pipeline that integrates point cloud and image features, explicitly encoding fine-grained features and dynamic temporal relationships between the two modalities. Additionally, we employ a discriminator structure for semi-supervised training. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance compared to previous methods.
APA
Cai, S., Ma, C., Wang, X., Lin, L., Yan, M., Lin, X., Fan, S. & Shen, S.. (2025). FTP: A Human Pose Estimation Method Integrating Temporal and Fine-Grained Feature Fusion. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:858-872 Available from https://proceedings.mlr.press/v260/cai25a.html.

Related Material