EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:1697-1721, 2026.

Abstract

This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets, with quantitative results reported on phantom and simulated data that provide ground truth depth. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-li26f, title = {EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams}, author = {Li, Hao and Lu, Daiwei and Wang, Jiacheng and Webster, Robert J. and Oguz, Ipek}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {1697--1721}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/li26f/li26f.pdf}, url = {https://proceedings.mlr.press/v315/li26f.html}, abstract = {This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets, with quantitative results reported on phantom and simulated data that provide ground truth depth. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery.} }
Endnote
%0 Conference Paper %T EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams %A Hao Li %A Daiwei Lu %A Jiacheng Wang %A Robert J. Webster %A Ipek Oguz %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-li26f %I PMLR %P 1697--1721 %U https://proceedings.mlr.press/v315/li26f.html %V 315 %X This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets, with quantitative results reported on phantom and simulated data that provide ground truth depth. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery.
APA
Li, H., Lu, D., Wang, J., Webster, R.J. & Oguz, I.. (2026). EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:1697-1721 Available from https://proceedings.mlr.press/v315/li26f.html.

Related Material