That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation

Abitha Thankaraj, Lerrel Pinto
Proceedings of The 7th Conference on Robot Learning, PMLR 229:1036-1049, 2023.

Abstract

Learning to produce contact-rich, dynamic behaviors from raw sensory data has been a longstanding challenge in robotics. Prominent approaches primarily focus on using visual and tactile sensing. However, pure vision often fails to capture high-frequency interaction, while current tactile sensors can be too delicate for large-scale data collection. In this work, we propose a data-centric approach to dynamic manipulation that uses an often ignored source of information – sound. We first collect a dataset of 25k interaction-sound pairs across five dynamic tasks using contact microphones. Then, given this data, we leverage self-supervised learning to accelerate behavior prediction from sound. Our experiments indicate that this self-supervised ‘pretraining’ is crucial to achieving high performance, with a 34.5 lower MSE than plain supervised learning and a 54.3 lower MSE over visual training. Importantly, we find that when asked to generate desired sound profiles, online rollouts of our models on a UR10 robot can produce dynamic behavior that achieves an average of 11.5 improvement over supervised learning on audio similarity metrics. Videos and audio data are best seen on our project website: aurl-anon.github.io

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-thankaraj23a, title = {That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation}, author = {Thankaraj, Abitha and Pinto, Lerrel}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {1036--1049}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/thankaraj23a/thankaraj23a.pdf}, url = {https://proceedings.mlr.press/v229/thankaraj23a.html}, abstract = {Learning to produce contact-rich, dynamic behaviors from raw sensory data has been a longstanding challenge in robotics. Prominent approaches primarily focus on using visual and tactile sensing. However, pure vision often fails to capture high-frequency interaction, while current tactile sensors can be too delicate for large-scale data collection. In this work, we propose a data-centric approach to dynamic manipulation that uses an often ignored source of information – sound. We first collect a dataset of 25k interaction-sound pairs across five dynamic tasks using contact microphones. Then, given this data, we leverage self-supervised learning to accelerate behavior prediction from sound. Our experiments indicate that this self-supervised ‘pretraining’ is crucial to achieving high performance, with a $34.5%$ lower MSE than plain supervised learning and a $54.3%$ lower MSE over visual training. Importantly, we find that when asked to generate desired sound profiles, online rollouts of our models on a UR10 robot can produce dynamic behavior that achieves an average of $11.5%$ improvement over supervised learning on audio similarity metrics. Videos and audio data are best seen on our project website: aurl-anon.github.io} }
Endnote
%0 Conference Paper %T That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation %A Abitha Thankaraj %A Lerrel Pinto %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-thankaraj23a %I PMLR %P 1036--1049 %U https://proceedings.mlr.press/v229/thankaraj23a.html %V 229 %X Learning to produce contact-rich, dynamic behaviors from raw sensory data has been a longstanding challenge in robotics. Prominent approaches primarily focus on using visual and tactile sensing. However, pure vision often fails to capture high-frequency interaction, while current tactile sensors can be too delicate for large-scale data collection. In this work, we propose a data-centric approach to dynamic manipulation that uses an often ignored source of information – sound. We first collect a dataset of 25k interaction-sound pairs across five dynamic tasks using contact microphones. Then, given this data, we leverage self-supervised learning to accelerate behavior prediction from sound. Our experiments indicate that this self-supervised ‘pretraining’ is crucial to achieving high performance, with a $34.5%$ lower MSE than plain supervised learning and a $54.3%$ lower MSE over visual training. Importantly, we find that when asked to generate desired sound profiles, online rollouts of our models on a UR10 robot can produce dynamic behavior that achieves an average of $11.5%$ improvement over supervised learning on audio similarity metrics. Videos and audio data are best seen on our project website: aurl-anon.github.io
APA
Thankaraj, A. & Pinto, L.. (2023). That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:1036-1049 Available from https://proceedings.mlr.press/v229/thankaraj23a.html.

Related Material