A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots

Peixin Chang, Shuijing Liu, Tianchen Ji, Neeloy Chakraborty, Kaiwen Hong, Katherine Rose Driggs-Campbell
Proceedings of The 7th Conference on Robot Learning, PMLR 229:1797-1819, 2023.

Abstract

A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-chang23a, title = {A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots}, author = {Chang, Peixin and Liu, Shuijing and Ji, Tianchen and Chakraborty, Neeloy and Hong, Kaiwen and Driggs-Campbell, Katherine Rose}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {1797--1819}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/chang23a/chang23a.pdf}, url = {https://proceedings.mlr.press/v229/chang23a.html}, abstract = {A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.} }
Endnote
%0 Conference Paper %T A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots %A Peixin Chang %A Shuijing Liu %A Tianchen Ji %A Neeloy Chakraborty %A Kaiwen Hong %A Katherine Rose Driggs-Campbell %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-chang23a %I PMLR %P 1797--1819 %U https://proceedings.mlr.press/v229/chang23a.html %V 229 %X A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.
APA
Chang, P., Liu, S., Ji, T., Chakraborty, N., Hong, K. & Driggs-Campbell, K.R.. (2023). A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:1797-1819 Available from https://proceedings.mlr.press/v229/chang23a.html.

Related Material