ECT-3DMedSAM: Efficient Cross Teaching Using Segment Anything Model for Semi-Supervised 3D Medical Image Segmentation

Zhewen Huang, Sara R. Guariglia, Jiaqi Yang, Chia-Ling Tsai
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2057-2070, 2026.

Abstract

As precise manual annotation for medical imaging is both expert-intensive and costly, Semi-Supervised Medical Image Segmentation (SSMIS) provides a critical solution by leveraging large volumes of unlabeled data to achieve the high-performance segmentation necessary for anatomical structure analysis and disease diagnosis. Standard SSMIS models typically train specialized models with limited initialization, often failing to capture the complex semantic nuances of 3D anatomy. Foundation models offer superior generalization capabilities by leveraging large-scale pre-training but still struggle to adapt effectively when downstream annotations are limited. In this paper, we propose a novel cross-teaching framework tailored for the efficient adaptation of the 3D foundation model (MedSAM-2). We introduce a parameter-efficient design that shares frozen image and prompt encoders between two parallel, Low-Rank Adaptation (LoRA) learnable mask decoders. Furthermore, we replace the memory-intensive attention mechanism with a light-weight temporal propagation module for reducing the memory consumption while maintaining critical local volumetric coherence. Our model processes the same input volume through weakly and strongly augmentations to create a synergistic learning loop where the two decoders mutually supervise each other. We validate our method across three distinct datasets and modalities. Experimental results demonstrate that our framework effectively bridges the domain gap, achieving a 57.9% reduction in the average 95% Hausdorff Distance, substantially enhancing boundary precision for fine anatomical structures. Furthermore, our approach outperforms state-of-the-art baselines with a Dice score improvement of up to 2.8%, confirming its robustness and clinical reliability for volumetric segmentation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-huang26a, title = {ECT-3DMedSAM: Efficient Cross Teaching Using Segment Anything Model for Semi-Supervised 3D Medical Image Segmentation}, author = {Huang, Zhewen and Guariglia, Sara R. and Yang, Jiaqi and Tsai, Chia-Ling}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {2057--2070}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/huang26a/huang26a.pdf}, url = {https://proceedings.mlr.press/v315/huang26a.html}, abstract = {As precise manual annotation for medical imaging is both expert-intensive and costly, Semi-Supervised Medical Image Segmentation (SSMIS) provides a critical solution by leveraging large volumes of unlabeled data to achieve the high-performance segmentation necessary for anatomical structure analysis and disease diagnosis. Standard SSMIS models typically train specialized models with limited initialization, often failing to capture the complex semantic nuances of 3D anatomy. Foundation models offer superior generalization capabilities by leveraging large-scale pre-training but still struggle to adapt effectively when downstream annotations are limited. In this paper, we propose a novel cross-teaching framework tailored for the efficient adaptation of the 3D foundation model (MedSAM-2). We introduce a parameter-efficient design that shares frozen image and prompt encoders between two parallel, Low-Rank Adaptation (LoRA) learnable mask decoders. Furthermore, we replace the memory-intensive attention mechanism with a light-weight temporal propagation module for reducing the memory consumption while maintaining critical local volumetric coherence. Our model processes the same input volume through weakly and strongly augmentations to create a synergistic learning loop where the two decoders mutually supervise each other. We validate our method across three distinct datasets and modalities. Experimental results demonstrate that our framework effectively bridges the domain gap, achieving a 57.9% reduction in the average 95% Hausdorff Distance, substantially enhancing boundary precision for fine anatomical structures. Furthermore, our approach outperforms state-of-the-art baselines with a Dice score improvement of up to 2.8%, confirming its robustness and clinical reliability for volumetric segmentation.} }
Endnote
%0 Conference Paper %T ECT-3DMedSAM: Efficient Cross Teaching Using Segment Anything Model for Semi-Supervised 3D Medical Image Segmentation %A Zhewen Huang %A Sara R. Guariglia %A Jiaqi Yang %A Chia-Ling Tsai %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-huang26a %I PMLR %P 2057--2070 %U https://proceedings.mlr.press/v315/huang26a.html %V 315 %X As precise manual annotation for medical imaging is both expert-intensive and costly, Semi-Supervised Medical Image Segmentation (SSMIS) provides a critical solution by leveraging large volumes of unlabeled data to achieve the high-performance segmentation necessary for anatomical structure analysis and disease diagnosis. Standard SSMIS models typically train specialized models with limited initialization, often failing to capture the complex semantic nuances of 3D anatomy. Foundation models offer superior generalization capabilities by leveraging large-scale pre-training but still struggle to adapt effectively when downstream annotations are limited. In this paper, we propose a novel cross-teaching framework tailored for the efficient adaptation of the 3D foundation model (MedSAM-2). We introduce a parameter-efficient design that shares frozen image and prompt encoders between two parallel, Low-Rank Adaptation (LoRA) learnable mask decoders. Furthermore, we replace the memory-intensive attention mechanism with a light-weight temporal propagation module for reducing the memory consumption while maintaining critical local volumetric coherence. Our model processes the same input volume through weakly and strongly augmentations to create a synergistic learning loop where the two decoders mutually supervise each other. We validate our method across three distinct datasets and modalities. Experimental results demonstrate that our framework effectively bridges the domain gap, achieving a 57.9% reduction in the average 95% Hausdorff Distance, substantially enhancing boundary precision for fine anatomical structures. Furthermore, our approach outperforms state-of-the-art baselines with a Dice score improvement of up to 2.8%, confirming its robustness and clinical reliability for volumetric segmentation.
APA
Huang, Z., Guariglia, S.R., Yang, J. & Tsai, C.. (2026). ECT-3DMedSAM: Efficient Cross Teaching Using Segment Anything Model for Semi-Supervised 3D Medical Image Segmentation. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2057-2070 Available from https://proceedings.mlr.press/v315/huang26a.html.

Related Material