Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation

Yu-Nong Scarlett Lin; Shansong Wang; Mojtaba Safari; Xiaofeng Yang

Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation

Yu-Nong Scarlett Lin, Shansong Wang, Mojtaba Safari, Xiaofeng Yang

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2780-2794, 2026.

Abstract

Self-supervised vision transformers like DINOv3 are strong universal feature extractors, yet their transferability to functional medical imaging remains limited when pretrained on misaligned natural-image domains. In this work, we introduce Dual-DINOv3, a dual-stream framework for PET/CT that addresses two key gaps in existing work: the absence of a public, PET-specific pretrained encoder and the reliance on fully paired PET/CT data for multimodal pretraining. First, we presented the first PET-specific DINOv3 encoder, pretrained exclusively on large-scale public FDG-PET datasets using the full three-stage DINOv3 self-distillation pipeline. Second, we proposed a modality-separated PET/CT framework in which PET- and CT-specific encoders are pretrained independently and fused during finetuning via multiscale cross-attention, enabling multimodal representation learning without requiring paired data during pretraining. Evaluation on the HECKTOR tumor segmentation benchmark demonstrated three central findings: (1) misaligned natural-image pretraining degrades PET/CT performance relative to training from scratch, (2) domain-aligned CT pretraining substantially improves segmentation across all tumor sizes, and (3) dual-stream PET/CT pretraining achieves the best performance overall, highlighting the complementary contributions of functional and anatomical cues. Together, these results provide a fully public PET encoder and a scalable PET/CT foundation model that support domain-aligned representation learning under realistic clinical data constraints.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-lin26b,
  title = 	 {Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation},
  author =       {Lin, Yu-Nong Scarlett and Wang, Shansong and Safari, Mojtaba and Yang, Xiaofeng},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {2780--2794},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/lin26b/lin26b.pdf},
  url = 	 {https://proceedings.mlr.press/v315/lin26b.html},
  abstract = 	 {Self-supervised vision transformers like DINOv3 are strong universal feature extractors, yet their transferability to functional medical imaging remains limited when pretrained on misaligned natural-image domains. In this work, we introduce Dual-DINOv3, a dual-stream framework for PET/CT that addresses two key gaps in existing work: the absence of a public, PET-specific pretrained encoder and the reliance on fully paired PET/CT data for multimodal pretraining. First, we presented the first PET-specific DINOv3 encoder, pretrained exclusively on large-scale public FDG-PET datasets using the full three-stage DINOv3 self-distillation pipeline. Second, we proposed a modality-separated PET/CT framework in which PET- and CT-specific encoders are pretrained independently and fused during finetuning via multiscale cross-attention, enabling multimodal representation learning without requiring paired data during pretraining. Evaluation on the HECKTOR tumor segmentation benchmark demonstrated three central findings: (1) misaligned natural-image pretraining degrades PET/CT performance relative to training from scratch, (2) domain-aligned CT pretraining substantially improves segmentation across all tumor sizes, and (3) dual-stream PET/CT pretraining achieves the best performance overall, highlighting the complementary contributions of functional and anatomical cues. Together, these results provide a fully public PET encoder and a scalable PET/CT foundation model that support domain-aligned representation learning under realistic clinical data constraints.}
}

Endnote

%0 Conference Paper
%T Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation
%A Yu-Nong Scarlett Lin
%A Shansong Wang
%A Mojtaba Safari
%A Xiaofeng Yang
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-lin26b
%I PMLR
%P 2780--2794
%U https://proceedings.mlr.press/v315/lin26b.html
%V 315
%X Self-supervised vision transformers like DINOv3 are strong universal feature extractors, yet their transferability to functional medical imaging remains limited when pretrained on misaligned natural-image domains. In this work, we introduce Dual-DINOv3, a dual-stream framework for PET/CT that addresses two key gaps in existing work: the absence of a public, PET-specific pretrained encoder and the reliance on fully paired PET/CT data for multimodal pretraining. First, we presented the first PET-specific DINOv3 encoder, pretrained exclusively on large-scale public FDG-PET datasets using the full three-stage DINOv3 self-distillation pipeline. Second, we proposed a modality-separated PET/CT framework in which PET- and CT-specific encoders are pretrained independently and fused during finetuning via multiscale cross-attention, enabling multimodal representation learning without requiring paired data during pretraining. Evaluation on the HECKTOR tumor segmentation benchmark demonstrated three central findings: (1) misaligned natural-image pretraining degrades PET/CT performance relative to training from scratch, (2) domain-aligned CT pretraining substantially improves segmentation across all tumor sizes, and (3) dual-stream PET/CT pretraining achieves the best performance overall, highlighting the complementary contributions of functional and anatomical cues. Together, these results provide a fully public PET encoder and a scalable PET/CT foundation model that support domain-aligned representation learning under realistic clinical data constraints.

APA

Lin, Y.S., Wang, S., Safari, M. & Yang, X.. (2026). Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2780-2794 Available from https://proceedings.mlr.press/v315/lin26b.html.

Related Material

Download PDF