Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping

Qianxu Wang, Congyue Deng, Tyler Ga Wei Lum, Yuanpei Chen, Yaodong Yang, Jeannette Bohg, Yixin Zhu, Leonidas Guibas
Proceedings of The 8th Conference on Robot Learning, PMLR 270:4495-4508, 2025.

Abstract

One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the *neural attention field* for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-wang25l, title = {Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping}, author = {Wang, Qianxu and Deng, Congyue and Lum, Tyler Ga Wei and Chen, Yuanpei and Yang, Yaodong and Bohg, Jeannette and Zhu, Yixin and Guibas, Leonidas}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {4495--4508}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/wang25l/wang25l.pdf}, url = {https://proceedings.mlr.press/v270/wang25l.html}, abstract = {One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the *neural attention field* for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.} }
Endnote
%0 Conference Paper %T Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping %A Qianxu Wang %A Congyue Deng %A Tyler Ga Wei Lum %A Yuanpei Chen %A Yaodong Yang %A Jeannette Bohg %A Yixin Zhu %A Leonidas Guibas %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-wang25l %I PMLR %P 4495--4508 %U https://proceedings.mlr.press/v270/wang25l.html %V 270 %X One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the *neural attention field* for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.
APA
Wang, Q., Deng, C., Lum, T.G.W., Chen, Y., Yang, Y., Bohg, J., Zhu, Y. & Guibas, L.. (2025). Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:4495-4508 Available from https://proceedings.mlr.press/v270/wang25l.html.

Related Material