Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation

Siddhant Haldar, Lerrel Pinto
Proceedings of The 9th Conference on Robot Learning, PMLR 305:1825-1846, 2025.

Abstract

Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Through experiments on a diverse set of real-world tasks, we demonstrate that Point Policy significantly outperforms prior methods for policy learning from human videos, performing well not only within the training distribution but also generalizing to novel object instances and cluttered environments. Videos of the robot are best viewed at anon-point-policy.github.io.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-haldar25a, title = {Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation}, author = {Haldar, Siddhant and Pinto, Lerrel}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {1825--1846}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/haldar25a/haldar25a.pdf}, url = {https://proceedings.mlr.press/v305/haldar25a.html}, abstract = {Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Through experiments on a diverse set of real-world tasks, we demonstrate that Point Policy significantly outperforms prior methods for policy learning from human videos, performing well not only within the training distribution but also generalizing to novel object instances and cluttered environments. Videos of the robot are best viewed at anon-point-policy.github.io.} }
Endnote
%0 Conference Paper %T Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation %A Siddhant Haldar %A Lerrel Pinto %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-haldar25a %I PMLR %P 1825--1846 %U https://proceedings.mlr.press/v305/haldar25a.html %V 305 %X Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Through experiments on a diverse set of real-world tasks, we demonstrate that Point Policy significantly outperforms prior methods for policy learning from human videos, performing well not only within the training distribution but also generalizing to novel object instances and cluttered environments. Videos of the robot are best viewed at anon-point-policy.github.io.
APA
Haldar, S. & Pinto, L.. (2025). Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1825-1846 Available from https://proceedings.mlr.press/v305/haldar25a.html.

Related Material