Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use

Haonan Chen, Cheng Zhu, Shuijing Liu, Yunzhu Li, Katherine Rose Driggs-Campbell
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2695-2713, 2025.

Abstract

Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy’s robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. Our method achieves a 71% improvement in task success and a 77% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also reduces data collection time by 41% compared with the state-of-the-art data collection interface.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-chen25d, title = {Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use}, author = {Chen, Haonan and Zhu, Cheng and Liu, Shuijing and Li, Yunzhu and Driggs-Campbell, Katherine Rose}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2695--2713}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/chen25d/chen25d.pdf}, url = {https://proceedings.mlr.press/v305/chen25d.html}, abstract = {Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy’s robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. Our method achieves a 71% improvement in task success and a 77% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also reduces data collection time by 41% compared with the state-of-the-art data collection interface.} }
Endnote
%0 Conference Paper %T Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use %A Haonan Chen %A Cheng Zhu %A Shuijing Liu %A Yunzhu Li %A Katherine Rose Driggs-Campbell %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-chen25d %I PMLR %P 2695--2713 %U https://proceedings.mlr.press/v305/chen25d.html %V 305 %X Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy’s robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. Our method achieves a 71% improvement in task success and a 77% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also reduces data collection time by 41% compared with the state-of-the-art data collection interface.
APA
Chen, H., Zhu, C., Liu, S., Li, Y. & Driggs-Campbell, K.R.. (2025). Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2695-2713 Available from https://proceedings.mlr.press/v305/chen25d.html.

Related Material