3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

Binghao Huang; Yixuan Wang; Xinyi Yang; Yiyue Luo; Yunzhu Li

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

Binghao Huang, Yixuan Wang, Xinyi Yang, Yiyue Luo, Yunzhu Li

Proceedings of The 8th Conference on Robot Learning, PMLR 270:2557-2578, 2025.

Abstract

Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces **3D-ViTac**, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3

$mm^2$ . These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at https://binghao-huang.github.io/3D-ViTac/.

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-huang25e,
  title = 	 {3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing},
  author =       {Huang, Binghao and Wang, Yixuan and Yang, Xinyi and Luo, Yiyue and Li, Yunzhu},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {2557--2578},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/huang25e/huang25e.pdf},
  url = 	 {https://proceedings.mlr.press/v270/huang25e.html},
  abstract = 	 {Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces **3D-ViTac**, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3$mm^2$. These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at https://binghao-huang.github.io/3D-ViTac/.}
}

Endnote

%0 Conference Paper
%T 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing
%A Binghao Huang
%A Yixuan Wang
%A Xinyi Yang
%A Yiyue Luo
%A Yunzhu Li
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-huang25e
%I PMLR
%P 2557--2578
%U https://proceedings.mlr.press/v270/huang25e.html
%V 270
%X Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces **3D-ViTac**, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3$mm^2$. These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at https://binghao-huang.github.io/3D-ViTac/.

APA

Huang, B., Wang, Y., Yang, X., Luo, Y. & Li, Y.. (2025). 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:2557-2578 Available from https://proceedings.mlr.press/v270/huang25e.html.

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

Abstract

Cite this Paper

Related Material