Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration

Jennifer Grannen, Siddharth Karamcheti, Suvir Mirchandani, Percy Liang, Dorsa Sadigh
Proceedings of The 8th Conference on Robot Learning, PMLR 270:1-24, 2025.

Abstract

We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to *adapt and continually learn* at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot’s capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for “tracking around” an object, users are provided with trajectory visualizations of the robot’s intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as “packing an object away” as compositions of low-level skills – concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines. Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+31.2%), helpfulness (+13.0%), and overall performance (+18.2%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to produce a 52 second (232 frame) movie. Videos & Supplementary Material: https://vocal-sandbox.github.io

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-grannen25a, title = {Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration}, author = {Grannen, Jennifer and Karamcheti, Siddharth and Mirchandani, Suvir and Liang, Percy and Sadigh, Dorsa}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {1--24}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/grannen25a/grannen25a.pdf}, url = {https://proceedings.mlr.press/v270/grannen25a.html}, abstract = {We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to *adapt and continually learn* at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot’s capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for “tracking around” an object, users are provided with trajectory visualizations of the robot’s intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as “packing an object away” as compositions of low-level skills – concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines. Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+31.2%), helpfulness (+13.0%), and overall performance (+18.2%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to produce a 52 second (232 frame) movie. Videos & Supplementary Material: https://vocal-sandbox.github.io} }
Endnote
%0 Conference Paper %T Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration %A Jennifer Grannen %A Siddharth Karamcheti %A Suvir Mirchandani %A Percy Liang %A Dorsa Sadigh %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-grannen25a %I PMLR %P 1--24 %U https://proceedings.mlr.press/v270/grannen25a.html %V 270 %X We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to *adapt and continually learn* at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot’s capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for “tracking around” an object, users are provided with trajectory visualizations of the robot’s intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as “packing an object away” as compositions of low-level skills – concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines. Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+31.2%), helpfulness (+13.0%), and overall performance (+18.2%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to produce a 52 second (232 frame) movie. Videos & Supplementary Material: https://vocal-sandbox.github.io
APA
Grannen, J., Karamcheti, S., Mirchandani, S., Liang, P. & Sadigh, D.. (2025). Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:1-24 Available from https://proceedings.mlr.press/v270/grannen25a.html.

Related Material