Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

Saumya Saxena; Mohit Sharma; Oliver Kroemer

Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

Saumya Saxena, Mohit Sharma, Oliver Kroemer

Proceedings of The 7th Conference on Robot Learning, PMLR 229:2210-2228, 2023.

Abstract

Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.

Cite this Paper

BibTeX


@InProceedings{pmlr-v229-saxena23a,
  title = 	 {Multi-Resolution Sensing for Real-Time Control with Vision-Language Models},
  author =       {Saxena, Saumya and Sharma, Mohit and Kroemer, Oliver},
  booktitle = 	 {Proceedings of The 7th Conference on Robot Learning},
  pages = 	 {2210--2228},
  year = 	 {2023},
  editor = 	 {Tan, Jie and Toussaint, Marc and Darvish, Kourosh},
  volume = 	 {229},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v229/saxena23a/saxena23a.pdf},
  url = 	 {https://proceedings.mlr.press/v229/saxena23a.html},
  abstract = 	 {Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.}
}

Endnote

%0 Conference Paper
%T Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
%A Saumya Saxena
%A Mohit Sharma
%A Oliver Kroemer
%B Proceedings of The 7th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Jie Tan
%E Marc Toussaint
%E Kourosh Darvish	
%F pmlr-v229-saxena23a
%I PMLR
%P 2210--2228
%U https://proceedings.mlr.press/v229/saxena23a.html
%V 229
%X Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves ($2\times$ on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.

APA


Saxena, S., Sharma, M. & Kroemer, O.. (2023). Multi-Resolution Sensing for Real-Time Control with Vision-Language Models. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:2210-2228 Available from https://proceedings.mlr.press/v229/saxena23a.html.

Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

Abstract

Cite this Paper

Related Material