Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States

Nicholas Carlotti, Mirko Nava, Alessandro Giusti
Proceedings of The 9th Conference on Robot Learning, PMLR 305:3304-3317, 2025.

Abstract

We introduce a model for monocular RGB relative pose estimation of a ground robot that trains from scratch without pose labels nor prior knowledge about the robot’s shape or appearance. At training time, we assume: (i) a robot fitted with multiple LEDs, whose states are independent and known at each frame; (ii) knowledge of the approximate viewing direction of each LED; and (iii) availability of a calibration image with a known target distance, to address the ambiguity of monocular depth estimation. Training data is collected by a pair of robots moving randomly without needing external infrastructure or human supervision. Our model trains on the task of predicting from an image the state of each LED on the robot. In doing so, it learns to predict the position of the robot in the image, its distance, and its relative bearing. At inference time, the state of the LEDs is unknown, can be arbitrary, and does not affect the pose estimation performance. Quantitative experiments indicate that our approach: is competitive with SoA approaches that require supervision from pose labels or a CAD model of the robot; generalizes to different domains; and handles multi-robot pose estimation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-carlotti25a, title = {Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States}, author = {Carlotti, Nicholas and Nava, Mirko and Giusti, Alessandro}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {3304--3317}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/carlotti25a/carlotti25a.pdf}, url = {https://proceedings.mlr.press/v305/carlotti25a.html}, abstract = {We introduce a model for monocular RGB relative pose estimation of a ground robot that trains from scratch without pose labels nor prior knowledge about the robot’s shape or appearance. At training time, we assume: (i) a robot fitted with multiple LEDs, whose states are independent and known at each frame; (ii) knowledge of the approximate viewing direction of each LED; and (iii) availability of a calibration image with a known target distance, to address the ambiguity of monocular depth estimation. Training data is collected by a pair of robots moving randomly without needing external infrastructure or human supervision. Our model trains on the task of predicting from an image the state of each LED on the robot. In doing so, it learns to predict the position of the robot in the image, its distance, and its relative bearing. At inference time, the state of the LEDs is unknown, can be arbitrary, and does not affect the pose estimation performance. Quantitative experiments indicate that our approach: is competitive with SoA approaches that require supervision from pose labels or a CAD model of the robot; generalizes to different domains; and handles multi-robot pose estimation.} }
Endnote
%0 Conference Paper %T Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States %A Nicholas Carlotti %A Mirko Nava %A Alessandro Giusti %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-carlotti25a %I PMLR %P 3304--3317 %U https://proceedings.mlr.press/v305/carlotti25a.html %V 305 %X We introduce a model for monocular RGB relative pose estimation of a ground robot that trains from scratch without pose labels nor prior knowledge about the robot’s shape or appearance. At training time, we assume: (i) a robot fitted with multiple LEDs, whose states are independent and known at each frame; (ii) knowledge of the approximate viewing direction of each LED; and (iii) availability of a calibration image with a known target distance, to address the ambiguity of monocular depth estimation. Training data is collected by a pair of robots moving randomly without needing external infrastructure or human supervision. Our model trains on the task of predicting from an image the state of each LED on the robot. In doing so, it learns to predict the position of the robot in the image, its distance, and its relative bearing. At inference time, the state of the LEDs is unknown, can be arbitrary, and does not affect the pose estimation performance. Quantitative experiments indicate that our approach: is competitive with SoA approaches that require supervision from pose labels or a CAD model of the robot; generalizes to different domains; and handles multi-robot pose estimation.
APA
Carlotti, N., Nava, M. & Giusti, A.. (2025). Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:3304-3317 Available from https://proceedings.mlr.press/v305/carlotti25a.html.

Related Material