Video-CT MAE: Self-supervised Video-CT Domain Adaptation for Vertebral Fracture Diagnosis

Lukas Buess, Marijn F. Stollenga, David Schinz, Benedikt Wiestler, Jan Kirschke, Andreas Maier, Nassir Navab, Matthias Keicher
Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, PMLR 250:151-167, 2024.

Abstract

Early and accurate diagnosis of vertebral body anomalies is crucial for effectively treating spinal disorders, but the manual interpretation of CT scans can be time-consuming and error-prone. While deep learning has shown promise in automating vertebral fracture detection, improving the interpretability of existing methods is crucial for building trust and ensuring reliable clinical application. Vision Transformers (ViTs) offer inherent interpretability through attention visualizations but are limited in their application to 3D medical images due to reliance on 2D image pretraining. To address this challenge, we propose a novel approach combining the benefits of transfer learning from video-pretrained models and domain adaptation with self-supervised pretraining on a task-specific but unlabeled dataset. Compared to naive transfer learning from Video MAE, our method shows improved downstream task performance by 8.3 in F1 and a training speedup of factor 2. This closes the gap between videos and medical images, allowing a ViT to learn relevant anatomical features while adapting to the task domain. We demonstrate that our framework enables ViTs to effectively detect vertebral fractures in a low data regime, outperforming CNN-based state-of-the-art methods while providing inherent interpretability. Our task adaptation approach and dataset not only improve the performance of our proposed method but also enhance existing self-supervised pretraining approaches, highlighting the benefits of task-specific self-supervised pretraining for domain adaptation. The code is publicly available.

Cite this Paper


BibTeX
@InProceedings{pmlr-v250-buess24a, title = {Video-CT MAE: Self-supervised Video-CT Domain Adaptation for Vertebral Fracture Diagnosis}, author = {Buess, Lukas and Stollenga, Marijn F. and Schinz, David and Wiestler, Benedikt and Kirschke, Jan and Maier, Andreas and Navab, Nassir and Keicher, Matthias}, booktitle = {Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning}, pages = {151--167}, year = {2024}, editor = {Burgos, Ninon and Petitjean, Caroline and Vakalopoulou, Maria and Christodoulidis, Stergios and Coupe, Pierrick and Delingette, Hervé and Lartizien, Carole and Mateus, Diana}, volume = {250}, series = {Proceedings of Machine Learning Research}, month = {03--05 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v250/main/assets/buess24a/buess24a.pdf}, url = {https://proceedings.mlr.press/v250/buess24a.html}, abstract = {Early and accurate diagnosis of vertebral body anomalies is crucial for effectively treating spinal disorders, but the manual interpretation of CT scans can be time-consuming and error-prone. While deep learning has shown promise in automating vertebral fracture detection, improving the interpretability of existing methods is crucial for building trust and ensuring reliable clinical application. Vision Transformers (ViTs) offer inherent interpretability through attention visualizations but are limited in their application to 3D medical images due to reliance on 2D image pretraining. To address this challenge, we propose a novel approach combining the benefits of transfer learning from video-pretrained models and domain adaptation with self-supervised pretraining on a task-specific but unlabeled dataset. Compared to naive transfer learning from Video MAE, our method shows improved downstream task performance by 8.3 in F1 and a training speedup of factor 2. This closes the gap between videos and medical images, allowing a ViT to learn relevant anatomical features while adapting to the task domain. We demonstrate that our framework enables ViTs to effectively detect vertebral fractures in a low data regime, outperforming CNN-based state-of-the-art methods while providing inherent interpretability. Our task adaptation approach and dataset not only improve the performance of our proposed method but also enhance existing self-supervised pretraining approaches, highlighting the benefits of task-specific self-supervised pretraining for domain adaptation. The code is publicly available.} }
Endnote
%0 Conference Paper %T Video-CT MAE: Self-supervised Video-CT Domain Adaptation for Vertebral Fracture Diagnosis %A Lukas Buess %A Marijn F. Stollenga %A David Schinz %A Benedikt Wiestler %A Jan Kirschke %A Andreas Maier %A Nassir Navab %A Matthias Keicher %B Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2024 %E Ninon Burgos %E Caroline Petitjean %E Maria Vakalopoulou %E Stergios Christodoulidis %E Pierrick Coupe %E Hervé Delingette %E Carole Lartizien %E Diana Mateus %F pmlr-v250-buess24a %I PMLR %P 151--167 %U https://proceedings.mlr.press/v250/buess24a.html %V 250 %X Early and accurate diagnosis of vertebral body anomalies is crucial for effectively treating spinal disorders, but the manual interpretation of CT scans can be time-consuming and error-prone. While deep learning has shown promise in automating vertebral fracture detection, improving the interpretability of existing methods is crucial for building trust and ensuring reliable clinical application. Vision Transformers (ViTs) offer inherent interpretability through attention visualizations but are limited in their application to 3D medical images due to reliance on 2D image pretraining. To address this challenge, we propose a novel approach combining the benefits of transfer learning from video-pretrained models and domain adaptation with self-supervised pretraining on a task-specific but unlabeled dataset. Compared to naive transfer learning from Video MAE, our method shows improved downstream task performance by 8.3 in F1 and a training speedup of factor 2. This closes the gap between videos and medical images, allowing a ViT to learn relevant anatomical features while adapting to the task domain. We demonstrate that our framework enables ViTs to effectively detect vertebral fractures in a low data regime, outperforming CNN-based state-of-the-art methods while providing inherent interpretability. Our task adaptation approach and dataset not only improve the performance of our proposed method but also enhance existing self-supervised pretraining approaches, highlighting the benefits of task-specific self-supervised pretraining for domain adaptation. The code is publicly available.
APA
Buess, L., Stollenga, M.F., Schinz, D., Wiestler, B., Kirschke, J., Maier, A., Navab, N. & Keicher, M.. (2024). Video-CT MAE: Self-supervised Video-CT Domain Adaptation for Vertebral Fracture Diagnosis. Proceedings of The 7nd International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 250:151-167 Available from https://proceedings.mlr.press/v250/buess24a.html.

Related Material