Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

Saumya Saxena, Mohit Sharma, Oliver Kroemer
Proceedings of The 7th Conference on Robot Learning, PMLR 229:2210-2228, 2023.

Abstract

Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-saxena23a, title = {Multi-Resolution Sensing for Real-Time Control with Vision-Language Models}, author = {Saxena, Saumya and Sharma, Mohit and Kroemer, Oliver}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {2210--2228}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/saxena23a/saxena23a.pdf}, url = {https://proceedings.mlr.press/v229/saxena23a.html}, abstract = {Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.} }
Endnote
%0 Conference Paper %T Multi-Resolution Sensing for Real-Time Control with Vision-Language Models %A Saumya Saxena %A Mohit Sharma %A Oliver Kroemer %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-saxena23a %I PMLR %P 2210--2228 %U https://proceedings.mlr.press/v229/saxena23a.html %V 229 %X Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
APA
Saxena, S., Sharma, M. & Kroemer, O.. (2023). Multi-Resolution Sensing for Real-Time Control with Vision-Language Models. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:2210-2228 Available from https://proceedings.mlr.press/v229/saxena23a.html.

Related Material