See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

Hao Li, Yizhi Zhang, Junzhe Zhu, Shaoxiong Wang, Michelle A Lee, Huazhe Xu, Edward Adelson, Li Fei-Fei, Ruohan Gao, Jiajun Wu
Proceedings of The 6th Conference on Robot Learning, PMLR 205:1368-1378, 2023.

Abstract

Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v205-li23c, title = {See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation}, author = {Li, Hao and Zhang, Yizhi and Zhu, Junzhe and Wang, Shaoxiong and Lee, Michelle A and Xu, Huazhe and Adelson, Edward and Fei-Fei, Li and Gao, Ruohan and Wu, Jiajun}, booktitle = {Proceedings of The 6th Conference on Robot Learning}, pages = {1368--1378}, year = {2023}, editor = {Liu, Karen and Kulic, Dana and Ichnowski, Jeff}, volume = {205}, series = {Proceedings of Machine Learning Research}, month = {14--18 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v205/li23c/li23c.pdf}, url = {https://proceedings.mlr.press/v205/li23c.html}, abstract = {Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.} }
Endnote
%0 Conference Paper %T See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation %A Hao Li %A Yizhi Zhang %A Junzhe Zhu %A Shaoxiong Wang %A Michelle A Lee %A Huazhe Xu %A Edward Adelson %A Li Fei-Fei %A Ruohan Gao %A Jiajun Wu %B Proceedings of The 6th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Karen Liu %E Dana Kulic %E Jeff Ichnowski %F pmlr-v205-li23c %I PMLR %P 1368--1378 %U https://proceedings.mlr.press/v205/li23c.html %V 205 %X Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
APA
Li, H., Zhang, Y., Zhu, J., Wang, S., Lee, M.A., Xu, H., Adelson, E., Fei-Fei, L., Gao, R. & Wu, J.. (2023). See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:1368-1378 Available from https://proceedings.mlr.press/v205/li23c.html.

Related Material