D$^2$-Former: Mixture-Of-Experts Guided Dual Transformer for Multi-Scale Medical Image Segmentation

Md Sohag Mia, Aya Taourirte, Muhammad Abdullah Adnan, Wenlong Ming
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:232-255, 2026.

Abstract

Precise delineation of anatomical structures from medical images is critical for clinical diagnosis and treatment planning, yet remains profoundly challenging due to ambiguous boundaries, extreme scale variations, and the heterogeneous appearances of pathological tissues. Current segmentation methods frequently fall short in effectively balancing global contextual understanding with adaptive, multi-scale feature fusion, limiting their robustness across diverse clinical scenarios. To address these limitations, we propose D2-Former, a novel encoder-decoder framework that integrates a dual-encoder architecture–combining a Swin Transformer for hierarchical local-global modeling and a DINOv3 foundation model for high-fidelity dense feature extraction—with a Softer Mixture-of-Experts (Softer-MoE) module for input-adaptive feature refinement. Our design further introduces a Spatial-Frequency Gated Channel Attention (SF-GCA) module to fuse complementary encoder representations and a Residual Attention Decoder (RAD) with deep supervision for progressive map reconstruction. Extensive experiments across nine public benchmarks–spanning polyp segmentation, retinal vessel delineation, multi-organ abdominal CT segmentation, and nuclei instance segmentation–demonstrate that D2-Former achieves state-of-the-art or highly competitive performance. The model exhibits strong generalization across varied anatomical scales, imaging modalities, and clinical scenarios, underscoring its potential for reliable computer-assisted diagnosis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-mia26a, title = {D$^2$-Former: Mixture-Of-Experts Guided Dual Transformer for Multi-Scale Medical Image Segmentation}, author = {Mia, Md Sohag and Taourirte, Aya and Adnan, Muhammad Abdullah and Ming, Wenlong}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {232--255}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/mia26a/mia26a.pdf}, url = {https://proceedings.mlr.press/v315/mia26a.html}, abstract = {Precise delineation of anatomical structures from medical images is critical for clinical diagnosis and treatment planning, yet remains profoundly challenging due to ambiguous boundaries, extreme scale variations, and the heterogeneous appearances of pathological tissues. Current segmentation methods frequently fall short in effectively balancing global contextual understanding with adaptive, multi-scale feature fusion, limiting their robustness across diverse clinical scenarios. To address these limitations, we propose D2-Former, a novel encoder-decoder framework that integrates a dual-encoder architecture–combining a Swin Transformer for hierarchical local-global modeling and a DINOv3 foundation model for high-fidelity dense feature extraction—with a Softer Mixture-of-Experts (Softer-MoE) module for input-adaptive feature refinement. Our design further introduces a Spatial-Frequency Gated Channel Attention (SF-GCA) module to fuse complementary encoder representations and a Residual Attention Decoder (RAD) with deep supervision for progressive map reconstruction. Extensive experiments across nine public benchmarks–spanning polyp segmentation, retinal vessel delineation, multi-organ abdominal CT segmentation, and nuclei instance segmentation–demonstrate that D2-Former achieves state-of-the-art or highly competitive performance. The model exhibits strong generalization across varied anatomical scales, imaging modalities, and clinical scenarios, underscoring its potential for reliable computer-assisted diagnosis.} }
Endnote
%0 Conference Paper %T D$^2$-Former: Mixture-Of-Experts Guided Dual Transformer for Multi-Scale Medical Image Segmentation %A Md Sohag Mia %A Aya Taourirte %A Muhammad Abdullah Adnan %A Wenlong Ming %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-mia26a %I PMLR %P 232--255 %U https://proceedings.mlr.press/v315/mia26a.html %V 315 %X Precise delineation of anatomical structures from medical images is critical for clinical diagnosis and treatment planning, yet remains profoundly challenging due to ambiguous boundaries, extreme scale variations, and the heterogeneous appearances of pathological tissues. Current segmentation methods frequently fall short in effectively balancing global contextual understanding with adaptive, multi-scale feature fusion, limiting their robustness across diverse clinical scenarios. To address these limitations, we propose D2-Former, a novel encoder-decoder framework that integrates a dual-encoder architecture–combining a Swin Transformer for hierarchical local-global modeling and a DINOv3 foundation model for high-fidelity dense feature extraction—with a Softer Mixture-of-Experts (Softer-MoE) module for input-adaptive feature refinement. Our design further introduces a Spatial-Frequency Gated Channel Attention (SF-GCA) module to fuse complementary encoder representations and a Residual Attention Decoder (RAD) with deep supervision for progressive map reconstruction. Extensive experiments across nine public benchmarks–spanning polyp segmentation, retinal vessel delineation, multi-organ abdominal CT segmentation, and nuclei instance segmentation–demonstrate that D2-Former achieves state-of-the-art or highly competitive performance. The model exhibits strong generalization across varied anatomical scales, imaging modalities, and clinical scenarios, underscoring its potential for reliable computer-assisted diagnosis.
APA
Mia, M.S., Taourirte, A., Adnan, M.A. & Ming, W.. (2026). D$^2$-Former: Mixture-Of-Experts Guided Dual Transformer for Multi-Scale Medical Image Segmentation. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:232-255 Available from https://proceedings.mlr.press/v315/mia26a.html.

Related Material