MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

Muhammad Osama Khan; Junbang Liang; Chun-Kai Wang; Shan Yang; Yu Lou

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

Muhammad Osama Khan, Junbang Liang, Chun-Kai Wang, Shan Yang, Yu Lou

Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, PMLR 243:116-132, 2024.

Abstract

Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a unified framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre-training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.

Cite this Paper

BibTeX


@InProceedings{pmlr-v243-khan24a,
  title = 	 {MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation},
  author =       {Khan, Muhammad Osama and Liang, Junbang and Wang, Chun-Kai and Yang, Shan and Lou, Yu},
  booktitle = 	 {Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models},
  pages = 	 {116--132},
  year = 	 {2024},
  editor = 	 {Fumero, Marco and Rodolá, Emanuele and Domine, Clementine and Locatello, Francesco and Dziugaite, Karolina and Mathilde, Caron},
  volume = 	 {243},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v243/khan24a/khan24a.pdf},
  url = 	 {https://proceedings.mlr.press/v243/khan24a.html},
  abstract = 	 {Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a unified framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre-training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.}
}

Endnote

%0 Conference Paper
%T MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation
%A Muhammad Osama Khan
%A Junbang Liang
%A Chun-Kai Wang
%A Shan Yang
%A Yu Lou
%B Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2024
%E Marco Fumero
%E Emanuele Rodolá
%E Clementine Domine
%E Francesco Locatello
%E Karolina Dziugaite
%E Caron Mathilde	
%F pmlr-v243-khan24a
%I PMLR
%P 116--132
%U https://proceedings.mlr.press/v243/khan24a.html
%V 243
%X Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a unified framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre-training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.

APA


Khan, M.O., Liang, J., Wang, C., Yang, S. & Lou, Y.. (2024). MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation. Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 243:116-132 Available from https://proceedings.mlr.press/v243/khan24a.html.

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

Abstract

Cite this Paper

Related Material