SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Yi Wei; Linqing Zhao; Wenzhao Zheng; Zheng Zhu; Yongming Rao; Guan Huang; Jiwen Lu; Jie Zhou

SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yongming Rao, Guan Huang, Jiwen Lu, Jie Zhou

Proceedings of The 6th Conference on Robot Learning, PMLR 205:539-549, 2023.

Abstract

Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric consistency enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes.

Cite this Paper

BibTeX


@InProceedings{pmlr-v205-wei23a,
  title = 	 {SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation},
  author =       {Wei, Yi and Zhao, Linqing and Zheng, Wenzhao and Zhu, Zheng and Rao, Yongming and Huang, Guan and Lu, Jiwen and Zhou, Jie},
  booktitle = 	 {Proceedings of The 6th Conference on Robot Learning},
  pages = 	 {539--549},
  year = 	 {2023},
  editor = 	 {Liu, Karen and Kulic, Dana and Ichnowski, Jeff},
  volume = 	 {205},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {14--18 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v205/wei23a/wei23a.pdf},
  url = 	 {https://proceedings.mlr.press/v205/wei23a.html},
  abstract = 	 {Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric consistency enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes. }
}

Endnote

%0 Conference Paper
%T SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation
%A Yi Wei
%A Linqing Zhao
%A Wenzhao Zheng
%A Zheng Zhu
%A Yongming Rao
%A Guan Huang
%A Jiwen Lu
%A Jie Zhou
%B Proceedings of The 6th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Karen Liu
%E Dana Kulic
%E Jeff Ichnowski	
%F pmlr-v205-wei23a
%I PMLR
%P 539--549
%U https://proceedings.mlr.press/v205/wei23a.html
%V 205
%X Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric consistency enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes.

APA


Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Rao, Y., Huang, G., Lu, J. & Zhou, J.. (2023). SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:539-549 Available from https://proceedings.mlr.press/v205/wei23a.html.

SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Abstract

Cite this Paper

Related Material