Articulated Object Estimation in the Wild

Abdelrhman Werby, Martin Büchner, Adrian Röfer, Chenguang Huang, Wolfram Burgard, Abhinav Valada
Proceedings of The 9th Conference on Robot Learning, PMLR 305:3828-3849, 2025.

Abstract

Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic, unconstrained environments. In contrast, humans effortlessly infer articulation modes by watching others manipulating objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework capable of inferring articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset capturing articulated object interactions at a scene level, accompanied with articulation labels and ground truth camera poses. We benchmark ArtiPoint against a range of classical and modern deep learning baselines, demonstrating its superior performance on Arti4D. We make our code and Arti4D publicly available at redacted-for-review.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-werby25a, title = {Articulated Object Estimation in the Wild}, author = {Werby, Abdelrhman and B\"{u}chner, Martin and R\"{o}fer, Adrian and Huang, Chenguang and Burgard, Wolfram and Valada, Abhinav}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {3828--3849}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/werby25a/werby25a.pdf}, url = {https://proceedings.mlr.press/v305/werby25a.html}, abstract = {Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic, unconstrained environments. In contrast, humans effortlessly infer articulation modes by watching others manipulating objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework capable of inferring articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset capturing articulated object interactions at a scene level, accompanied with articulation labels and ground truth camera poses. We benchmark ArtiPoint against a range of classical and modern deep learning baselines, demonstrating its superior performance on Arti4D. We make our code and Arti4D publicly available at redacted-for-review.} }
Endnote
%0 Conference Paper %T Articulated Object Estimation in the Wild %A Abdelrhman Werby %A Martin Büchner %A Adrian Röfer %A Chenguang Huang %A Wolfram Burgard %A Abhinav Valada %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-werby25a %I PMLR %P 3828--3849 %U https://proceedings.mlr.press/v305/werby25a.html %V 305 %X Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic, unconstrained environments. In contrast, humans effortlessly infer articulation modes by watching others manipulating objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework capable of inferring articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset capturing articulated object interactions at a scene level, accompanied with articulation labels and ground truth camera poses. We benchmark ArtiPoint against a range of classical and modern deep learning baselines, demonstrating its superior performance on Arti4D. We make our code and Arti4D publicly available at redacted-for-review.
APA
Werby, A., Büchner, M., Röfer, A., Huang, C., Burgard, W. & Valada, A.. (2025). Articulated Object Estimation in the Wild. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3828-3849 Available from https://proceedings.mlr.press/v305/werby25a.html.

Related Material