Towards Effective Surgical Representation Learning with DINO Models

Ronald L.P.D. de Jong, Yiping Li, Tim J.M. Jaspers, Romy C. van Jaarsveld, Gino M. Kuiper, Franco Badaloni, Richard van Hillegersberg, Jelle P. Ruurda, Fons van der Sommen, Josien P.W. Pluim, Marcel Breeuwer
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:1065-1080, 2026.

Abstract

Self-supervised learning (SSL) has emerged as a promising approach to address the limitations of annotated surgical datasets, which are often small, heterogeneous, and expensive to curate. Among SSL methods, self-distillation with no labels (DINO) has achieved state-of-the-art (SOTA) results in natural images, but its applicability to surgical data remains underexplored. In this work, we systematically investigate DINOv1, DINOv2, and DINOv3 for surgical representation learning. We pretrain these models on a large-scale surgical dataset of 4.7M video frames (SurgeNetXL) and evaluate their transferability on downstream tasks including semantic segmentation and surgical phase recognition. Our results demonstrate that in-domain pretraining consistently improves performance across all DINO variants, with DINOv2 and DINOv3 achieving SOTA performance. We further offer practical insights and visualizations highlighting the effectiveness of SSL. Finally, our study delivers ready-to-use DINO-based SSL models and pretraining protocols for surgical computer vision research, which are publicly available at: github.com/rlpddejong/SurgeNetDINO.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-jong26a, title = {Towards Effective Surgical Representation Learning with DINO Models}, author = {de Jong, Ronald L.P.D. and Li, Yiping and Jaspers, Tim J.M. and van Jaarsveld, Romy C. and Kuiper, Gino M. and Badaloni, Franco and van Hillegersberg, Richard and Ruurda, Jelle P. and van der Sommen, Fons and Pluim, Josien P.W. and Breeuwer, Marcel}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {1065--1080}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/jong26a/jong26a.pdf}, url = {https://proceedings.mlr.press/v315/jong26a.html}, abstract = {Self-supervised learning (SSL) has emerged as a promising approach to address the limitations of annotated surgical datasets, which are often small, heterogeneous, and expensive to curate. Among SSL methods, self-distillation with no labels (DINO) has achieved state-of-the-art (SOTA) results in natural images, but its applicability to surgical data remains underexplored. In this work, we systematically investigate DINOv1, DINOv2, and DINOv3 for surgical representation learning. We pretrain these models on a large-scale surgical dataset of 4.7M video frames (SurgeNetXL) and evaluate their transferability on downstream tasks including semantic segmentation and surgical phase recognition. Our results demonstrate that in-domain pretraining consistently improves performance across all DINO variants, with DINOv2 and DINOv3 achieving SOTA performance. We further offer practical insights and visualizations highlighting the effectiveness of SSL. Finally, our study delivers ready-to-use DINO-based SSL models and pretraining protocols for surgical computer vision research, which are publicly available at: github.com/rlpddejong/SurgeNetDINO.} }
Endnote
%0 Conference Paper %T Towards Effective Surgical Representation Learning with DINO Models %A Ronald L.P.D. de Jong %A Yiping Li %A Tim J.M. Jaspers %A Romy C. van Jaarsveld %A Gino M. Kuiper %A Franco Badaloni %A Richard van Hillegersberg %A Jelle P. Ruurda %A Fons van der Sommen %A Josien P.W. Pluim %A Marcel Breeuwer %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-jong26a %I PMLR %P 1065--1080 %U https://proceedings.mlr.press/v315/jong26a.html %V 315 %X Self-supervised learning (SSL) has emerged as a promising approach to address the limitations of annotated surgical datasets, which are often small, heterogeneous, and expensive to curate. Among SSL methods, self-distillation with no labels (DINO) has achieved state-of-the-art (SOTA) results in natural images, but its applicability to surgical data remains underexplored. In this work, we systematically investigate DINOv1, DINOv2, and DINOv3 for surgical representation learning. We pretrain these models on a large-scale surgical dataset of 4.7M video frames (SurgeNetXL) and evaluate their transferability on downstream tasks including semantic segmentation and surgical phase recognition. Our results demonstrate that in-domain pretraining consistently improves performance across all DINO variants, with DINOv2 and DINOv3 achieving SOTA performance. We further offer practical insights and visualizations highlighting the effectiveness of SSL. Finally, our study delivers ready-to-use DINO-based SSL models and pretraining protocols for surgical computer vision research, which are publicly available at: github.com/rlpddejong/SurgeNetDINO.
APA
de Jong, R.L., Li, Y., Jaspers, T.J., van Jaarsveld, R.C., Kuiper, G.M., Badaloni, F., van Hillegersberg, R., Ruurda, J.P., van der Sommen, F., Pluim, J.P. & Breeuwer, M.. (2026). Towards Effective Surgical Representation Learning with DINO Models. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:1065-1080 Available from https://proceedings.mlr.press/v315/jong26a.html.

Related Material