LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Dhruv Shah; Błażej Osiński; brian ichter; Sergey Levine

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Dhruv Shah, Błażej Osiński, brian ichter, Sergey Levine

Proceedings of The 6th Conference on Robot Learning, PMLR 205:492-504, 2023.

Abstract

Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories, while still providing a high-level interface to the user. Instead of utilizing a labeled instruction following dataset, we show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data. LM-Nav extracts landmarks names from an instruction, grounds them in the world via the image-language model, and then reaches them via the (vision-only) navigation model. We instantiate LM-Nav on a real-world mobile robot and demonstrate long-horizon navigation through complex, outdoor environments from natural language instructions.

Cite this Paper

BibTeX


@InProceedings{pmlr-v205-shah23b,
  title = 	 {LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action},
  author =       {Shah, Dhruv and Osi\'nski, B\l{a}\.zej and ichter, brian and Levine, Sergey},
  booktitle = 	 {Proceedings of The 6th Conference on Robot Learning},
  pages = 	 {492--504},
  year = 	 {2023},
  editor = 	 {Liu, Karen and Kulic, Dana and Ichnowski, Jeff},
  volume = 	 {205},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--18 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v205/shah23b/shah23b.pdf},
  url = 	 {https://proceedings.mlr.press/v205/shah23b.html},
  abstract = 	 {Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories, while still providing a high-level interface to the user. Instead of utilizing a labeled instruction following dataset, we show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data. LM-Nav extracts landmarks names from an instruction, grounds them in the world via the image-language model, and then reaches them via the (vision-only) navigation model. We instantiate LM-Nav on a real-world  mobile robot and demonstrate long-horizon navigation through complex, outdoor environments from natural language instructions.}
}

Endnote

%0 Conference Paper
%T LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
%A Dhruv Shah
%A Błażej Osiński
%A brian ichter
%A Sergey Levine
%B Proceedings of The 6th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Karen Liu
%E Dana Kulic
%E Jeff Ichnowski	
%F pmlr-v205-shah23b
%I PMLR
%P 492--504
%U https://proceedings.mlr.press/v205/shah23b.html
%V 205
%X Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories, while still providing a high-level interface to the user. Instead of utilizing a labeled instruction following dataset, we show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data. LM-Nav extracts landmarks names from an instruction, grounds them in the world via the image-language model, and then reaches them via the (vision-only) navigation model. We instantiate LM-Nav on a real-world  mobile robot and demonstrate long-horizon navigation through complex, outdoor environments from natural language instructions.

APA


Shah, D., Osiński, B., ichter, b. & Levine, S.. (2023). LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:492-504 Available from https://proceedings.mlr.press/v205/shah23b.html.

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Abstract

Cite this Paper

Related Material