[edit]
FTP: A Human Pose Estimation Method Integrating Temporal and Fine-Grained Feature Fusion
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:858-872, 2025.
Abstract
Human pose estimation is a significant research direction in the field of computer vision, with critical applications in human motion reconstruction and analysis. Currently proposed human pose estimation methods primarily focus on single-modality sensor information, such as RGB images and LiDAR point clouds. While these methods have achieved promising results within their respective domains, they remain limited by the inherent deficiencies of each modality, hindering their applicability across diverse real-world scenarios. With the recent introduction of numerous multi-modality human pose datasets, multi-modality approaches have begun to develop. However, existing multi-modality fusion methods mainly consider the global feature relationships between different modalities, without modeling finer-grained features or the dynamic temporal relationships between modalities. To address this issue, we propose a novel pipeline that integrates point cloud and image features, explicitly encoding fine-grained features and dynamic temporal relationships between the two modalities. Additionally, we employ a discriminator structure for semi-supervised training. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance compared to previous methods.