Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

Ruoxuan Feng, Di Hu, Wenke Ma, Xuelong Li
Proceedings of The 8th Conference on Robot Learning, PMLR 270:340-363, 2025.

Abstract

Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-feng25a, title = {Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation}, author = {Feng, Ruoxuan and Hu, Di and Ma, Wenke and Li, Xuelong}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {340--363}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/feng25a/feng25a.pdf}, url = {https://proceedings.mlr.press/v270/feng25a.html}, abstract = {Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.} }
Endnote
%0 Conference Paper %T Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation %A Ruoxuan Feng %A Di Hu %A Wenke Ma %A Xuelong Li %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-feng25a %I PMLR %P 340--363 %U https://proceedings.mlr.press/v270/feng25a.html %V 270 %X Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.
APA
Feng, R., Hu, D., Ma, W. & Li, X.. (2025). Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:340-363 Available from https://proceedings.mlr.press/v270/feng25a.html.

Related Material