ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Rongyu Chen, Li’An Zhuo, Linlin Yang, Qi Wang, Liefeng Bo, Bang Zhang, Angela Yao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:9933-9946, 2025.

Abstract

Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25cs, title = {{E}xt{P}ose: Robust and Coherent Pose Estimation by Extending {V}i{T}s}, author = {Chen, Rongyu and Zhuo, Li'An and Yang, Linlin and Wang, Qi and Bo, Liefeng and Zhang, Bang and Yao, Angela}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {9933--9946}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25cs/chen25cs.pdf}, url = {https://proceedings.mlr.press/v267/chen25cs.html}, abstract = {Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.} }
Endnote
%0 Conference Paper %T ExtPose: Robust and Coherent Pose Estimation by Extending ViTs %A Rongyu Chen %A Li’An Zhuo %A Linlin Yang %A Qi Wang %A Liefeng Bo %A Bang Zhang %A Angela Yao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25cs %I PMLR %P 9933--9946 %U https://proceedings.mlr.press/v267/chen25cs.html %V 267 %X Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.
APA
Chen, R., Zhuo, L., Yang, L., Wang, Q., Bo, L., Zhang, B. & Yao, A.. (2025). ExtPose: Robust and Coherent Pose Estimation by Extending ViTs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:9933-9946 Available from https://proceedings.mlr.press/v267/chen25cs.html.

Related Material