CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications

Jan Blumenkamp; Steven Morad; Jennifer Gielis; Amanda Prorok

CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications

Jan Blumenkamp, Steven Morad, Jennifer Gielis, Amanda Prorok

Proceedings of The 8th Conference on Robot Learning, PMLR 270:3780-3808, 2025.

Abstract

Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. Classical vision-based methods to regress relative pose are commonly computationally expensive (precluding real-time applications), and often lack data-derived priors for resolving ambiguities. In this work, we propose CoViS-Net, a cooperative, multi-robot visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as general spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird’s-eye-view (BEV) representation, even without camera overlap between robots, and can predict BEV representations of unseen regions. We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide supplementary material online and will open source our trained model in due course. https://sites.google.com/view/covis-net

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-blumenkamp25a,
  title = 	 {CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications},
  author =       {Blumenkamp, Jan and Morad, Steven and Gielis, Jennifer and Prorok, Amanda},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {3780--3808},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/blumenkamp25a/blumenkamp25a.pdf},
  url = 	 {https://proceedings.mlr.press/v270/blumenkamp25a.html},
  abstract = 	 {Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. Classical vision-based methods to regress relative pose are commonly computationally expensive (precluding real-time applications), and often lack data-derived priors for resolving ambiguities. In this work, we propose CoViS-Net, a cooperative, multi-robot visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as general spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird’s-eye-view (BEV) representation, even without camera overlap between robots, and can predict BEV representations of unseen regions. We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide supplementary material online and will open source our trained model in due course. https://sites.google.com/view/covis-net}
}

Endnote

%0 Conference Paper
%T CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications
%A Jan Blumenkamp
%A Steven Morad
%A Jennifer Gielis
%A Amanda Prorok
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-blumenkamp25a
%I PMLR
%P 3780--3808
%U https://proceedings.mlr.press/v270/blumenkamp25a.html
%V 270
%X Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. Classical vision-based methods to regress relative pose are commonly computationally expensive (precluding real-time applications), and often lack data-derived priors for resolving ambiguities. In this work, we propose CoViS-Net, a cooperative, multi-robot visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as general spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird’s-eye-view (BEV) representation, even without camera overlap between robots, and can predict BEV representations of unseen regions. We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide supplementary material online and will open source our trained model in due course. https://sites.google.com/view/covis-net

APA

Blumenkamp, J., Morad, S., Gielis, J. & Prorok, A.. (2025). CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:3780-3808 Available from https://proceedings.mlr.press/v270/blumenkamp25a.html.

CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications

Abstract

Cite this Paper

Related Material