D$^3$epth: Distilling Diffusion Models For Efficient Depth Estimation Through A Two-Stage Approach

Bo-Chih Chuang; Wei-Tung Lin; Shang-Fu Chen; Kailung Hua

D$^3$epth: Distilling Diffusion Models For Efficient Depth Estimation Through A Two-Stage Approach

Bo-Chih Chuang, Wei-Tung Lin, Shang-Fu Chen, Kailung Hua

Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:81-96, 2025.

Abstract

Diffusion-based monocular depth estimation models demonstrate strong performance with limited supervision by leveraging pre-trained text-to-image models. However, their multi-step inference process and large model size create prohibitive computational overhead for practical applications. To retain the data efficiency of diffusion models while addressing their inference inefficiency, we propose a framework that enhances diffusion-based depth estimation through a two-stage training approach. The first stage distills implicit depth knowledge in the latent space by leveraging the rich representations from pre-trained diffusion models. The second stage refines explicit depth predictions in pixel space using Hybrid Depth Loss that combines Shift-Scale Invariant (SSI) loss for global structure preservation with Edge-aware Gradient Huber loss for fine-grained detail enhancement. Both components are adaptively weighted using a dynamic task weighting strategy, balancing structural consistency and boundary precision. Specifically, we demonstrate that our two-stage distillation approach yields D$^3$epth, an efficient variant that achieves state-of-the-art results while significantly reducing computational requirements. In parallel, our base model D$^2$epth, trained with enhanced pixel-space depth loss, also surpasses state-of-the-art performance across various benchmarks. Overall, these results deliver the accuracy benefits of diffusion-based methods at the efficiency level of traditional data-driven approaches.

Cite this Paper

BibTeX

@InProceedings{pmlr-v304-chuang25a,
  title = 	 {D$^3$epth: Distilling Diffusion Models For Efficient Depth Estimation Through A Two-Stage Approach},
  author =       {Chuang, Bo-Chih and Lin, Wei-Tung and Chen, Shang-Fu and Hua, Kailung},
  booktitle = 	 {Proceedings of the 17th Asian Conference on Machine Learning},
  pages = 	 {81--96},
  year = 	 {2025},
  editor = 	 {Lee, Hung-yi and Liu, Tongliang},
  volume = 	 {304},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v304/main/assets/chuang25a/chuang25a.pdf},
  url = 	 {https://proceedings.mlr.press/v304/chuang25a.html},
  abstract = 	 {Diffusion-based monocular depth estimation models demonstrate strong performance with limited supervision by leveraging pre-trained text-to-image models. However, their multi-step inference process and large model size create prohibitive computational overhead for practical applications.  To retain the data efficiency of diffusion models while addressing their inference inefficiency, we propose a framework that enhances diffusion-based depth estimation through a two-stage training approach. The first stage distills implicit depth knowledge in the latent space by leveraging the rich representations from pre-trained diffusion models. The second stage refines explicit depth predictions in pixel space using Hybrid Depth Loss that combines Shift-Scale Invariant (SSI) loss for global structure preservation with Edge-aware Gradient Huber loss for fine-grained detail enhancement. Both components are adaptively weighted using a dynamic task weighting strategy, balancing structural consistency and boundary precision. Specifically, we demonstrate that our two-stage distillation approach yields D$^3$epth, an efficient variant that achieves state-of-the-art results while significantly reducing computational requirements. In parallel, our base model D$^2$epth, trained with enhanced pixel-space depth loss, also surpasses state-of-the-art performance across various benchmarks.  Overall, these results deliver the accuracy benefits of diffusion-based methods at the efficiency level of traditional data-driven approaches.}
}

Endnote

%0 Conference Paper
%T D$^3$epth: Distilling Diffusion Models For Efficient Depth Estimation Through A Two-Stage Approach
%A Bo-Chih Chuang
%A Wei-Tung Lin
%A Shang-Fu Chen
%A Kailung Hua
%B Proceedings of the 17th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Hung-yi Lee
%E Tongliang Liu	
%F pmlr-v304-chuang25a
%I PMLR
%P 81--96
%U https://proceedings.mlr.press/v304/chuang25a.html
%V 304
%X Diffusion-based monocular depth estimation models demonstrate strong performance with limited supervision by leveraging pre-trained text-to-image models. However, their multi-step inference process and large model size create prohibitive computational overhead for practical applications.  To retain the data efficiency of diffusion models while addressing their inference inefficiency, we propose a framework that enhances diffusion-based depth estimation through a two-stage training approach. The first stage distills implicit depth knowledge in the latent space by leveraging the rich representations from pre-trained diffusion models. The second stage refines explicit depth predictions in pixel space using Hybrid Depth Loss that combines Shift-Scale Invariant (SSI) loss for global structure preservation with Edge-aware Gradient Huber loss for fine-grained detail enhancement. Both components are adaptively weighted using a dynamic task weighting strategy, balancing structural consistency and boundary precision. Specifically, we demonstrate that our two-stage distillation approach yields D$^3$epth, an efficient variant that achieves state-of-the-art results while significantly reducing computational requirements. In parallel, our base model D$^2$epth, trained with enhanced pixel-space depth loss, also surpasses state-of-the-art performance across various benchmarks.  Overall, these results deliver the accuracy benefits of diffusion-based methods at the efficiency level of traditional data-driven approaches.

APA

Chuang, B., Lin, W., Chen, S. & Hua, K.. (2025). D$^3$epth: Distilling Diffusion Models For Efficient Depth Estimation Through A Two-Stage Approach. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:81-96 Available from https://proceedings.mlr.press/v304/chuang25a.html.

Related Material

Download PDF