Semantic Mechanical Search with Large Vision and Language Models

Satvik Sharma; Huang Huang; Kaushik Shivakumar; Lawrence Yunliang Chen; Ryan Hoque; Brian Ichter; Ken Goldberg

Semantic Mechanical Search with Large Vision and Language Models

Satvik Sharma, Huang Huang, Kaushik Shivakumar, Lawrence Yunliang Chen, Ryan Hoque, Brian Ichter, Ken Goldberg

Proceedings of The 7th Conference on Robot Learning, PMLR 229:971-1005, 2023.

Abstract

Moving objects to find a fully-occluded target object, known as mechanical search, is a challenging problem in robotics. As objects are often organized semantically, we conjecture that semantic information about object relationships can facilitate mechanical search and reduce search time. Large pretrained vision and language models (VLMs and LLMs) have shown promise in generalizing to uncommon objects and previously unseen real-world environments. In this work, we propose a novel framework called Semantic Mechanical Search (SMS). SMS conducts scene understanding and generates a semantic occupancy distribution explicitly using LLMs. Compared to methods that rely on visual similarities offered by CLIP embeddings, SMS leverages the deep reasoning capabilities of LLMs. Unlike prior work that uses VLMs and LLMs as end-to-end planners, which may not integrate well with specialized geometric planners, SMS can serve as a plug-in semantic module for downstream manipulation or navigation policies. For mechanical search in closed-world settings such as shelves, we compare with a geometric-based planner and show that SMS improves mechanical search performance by $24%$ across the pharmacy, kitchen, and office domains in simulation and $47.1%$ in physical experiments. For open-world real environments, SMS can produce better semantic distributions compared to CLIP-based methods, with the potential to be integrated with downstream navigation policies to improve object navigation tasks. Code, data, videos, and Appendix are available here.

Cite this Paper

BibTeX


@InProceedings{pmlr-v229-sharma23a,
  title = 	 {Semantic Mechanical Search with Large Vision and Language Models},
  author =       {Sharma, Satvik and Huang, Huang and Shivakumar, Kaushik and Chen, Lawrence Yunliang and Hoque, Ryan and Ichter, Brian and Goldberg, Ken},
  booktitle = 	 {Proceedings of The 7th Conference on Robot Learning},
  pages = 	 {971--1005},
  year = 	 {2023},
  editor = 	 {Tan, Jie and Toussaint, Marc and Darvish, Kourosh},
  volume = 	 {229},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v229/sharma23a/sharma23a.pdf},
  url = 	 {https://proceedings.mlr.press/v229/sharma23a.html},
  abstract = 	 {Moving objects to find a fully-occluded target object, known as mechanical search, is a challenging problem in robotics. As objects are often organized semantically, we conjecture that semantic information about object relationships can facilitate mechanical search and reduce search time. Large pretrained vision and language models (VLMs and LLMs) have shown promise in generalizing to uncommon objects and previously unseen real-world environments. In this work, we propose a novel framework called Semantic Mechanical Search (SMS). SMS conducts scene understanding and generates a semantic occupancy distribution explicitly using LLMs. Compared to methods that rely on visual similarities offered by CLIP embeddings, SMS leverages the deep reasoning capabilities of LLMs. Unlike prior work that uses VLMs and LLMs as end-to-end planners, which may not integrate well with specialized geometric planners, SMS can serve as a plug-in semantic module for downstream manipulation or navigation policies. For mechanical search in closed-world settings such as shelves, we compare with a geometric-based planner and show that SMS improves mechanical search performance by $24%$ across the pharmacy, kitchen, and office domains in simulation and $47.1%$ in physical experiments. For open-world real environments, SMS can produce better semantic distributions compared to CLIP-based methods, with the potential to be integrated with downstream navigation policies to improve object navigation tasks. Code, data, videos, and Appendix are available here.}
}

Endnote

%0 Conference Paper
%T Semantic Mechanical Search with Large Vision and Language Models
%A Satvik Sharma
%A Huang Huang
%A Kaushik Shivakumar
%A Lawrence Yunliang Chen
%A Ryan Hoque
%A Brian Ichter
%A Ken Goldberg
%B Proceedings of The 7th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Jie Tan
%E Marc Toussaint
%E Kourosh Darvish	
%F pmlr-v229-sharma23a
%I PMLR
%P 971--1005
%U https://proceedings.mlr.press/v229/sharma23a.html
%V 229
%X Moving objects to find a fully-occluded target object, known as mechanical search, is a challenging problem in robotics. As objects are often organized semantically, we conjecture that semantic information about object relationships can facilitate mechanical search and reduce search time. Large pretrained vision and language models (VLMs and LLMs) have shown promise in generalizing to uncommon objects and previously unseen real-world environments. In this work, we propose a novel framework called Semantic Mechanical Search (SMS). SMS conducts scene understanding and generates a semantic occupancy distribution explicitly using LLMs. Compared to methods that rely on visual similarities offered by CLIP embeddings, SMS leverages the deep reasoning capabilities of LLMs. Unlike prior work that uses VLMs and LLMs as end-to-end planners, which may not integrate well with specialized geometric planners, SMS can serve as a plug-in semantic module for downstream manipulation or navigation policies. For mechanical search in closed-world settings such as shelves, we compare with a geometric-based planner and show that SMS improves mechanical search performance by $24%$ across the pharmacy, kitchen, and office domains in simulation and $47.1%$ in physical experiments. For open-world real environments, SMS can produce better semantic distributions compared to CLIP-based methods, with the potential to be integrated with downstream navigation policies to improve object navigation tasks. Code, data, videos, and Appendix are available here.

APA


Sharma, S., Huang, H., Shivakumar, K., Chen, L.Y., Hoque, R., Ichter, B. & Goldberg, K.. (2023). Semantic Mechanical Search with Large Vision and Language Models. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:971-1005 Available from https://proceedings.mlr.press/v229/sharma23a.html.

Semantic Mechanical Search with Large Vision and Language Models

Abstract

Cite this Paper

Related Material