On the Stability and Robustness of Vision Transformers for Neurodegenerative Disease Classification

Eloi Navet, Rémi Giraud, Boris Mansencal, Pierrick Coupé
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4518-4554, 2026.

Abstract

Vision Transformers (ViTs) have recently been explored for structural MRI classification, motivated by their ability to capture non-local image structure. However, in limited and heterogeneous clinical cohorts, their weak inductive biases and sensitivity to training conditions often lead to high-variance behaviour. While binary settings such as cognitively normal vs. dementia are widely reported and typically exhibit moderate variability, we show that this stability does not extend to differential diagnosis. When increasing task complexity (e.g., controls vs. Alzheimer’s Disease vs. Frontotemporal Dementia), performance becomes sensitive to class imbalance and phenotype overlap, with greater variability driven by fewer samples per class, noisier labels, and increased inter-site heterogeneity. In this study, we investigate a stabilization protocol combining data augmentation, architectural constraints, and optimization strategies on multi-site MRI datasets. We assess how model variance evolves with task complexity using patient-level paired bootstrapping, calibration analysis, paired significance tests, and estimates of the probability of false outperformance to obtain uncertainty-aware comparisons across models. Our results highlight conditions under which Transformer-based classifiers can be consistently trained with limited neuroimaging data and illustrate that several performance gains disappear once stochastic variability is reported. These results emphasize that reliable differential diagnosis with ViTs requires both robust stabilization protocols to mitigate optimization noise and standardized uncertainty quantification beyond simple point-estimates.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-navet26a, title = {On the Stability and Robustness of Vision Transformers for Neurodegenerative Disease Classification}, author = {Navet, Eloi and Giraud, R{\'e}mi and Mansencal, Boris and Coup{\'e}, Pierrick}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {4518--4554}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/navet26a/navet26a.pdf}, url = {https://proceedings.mlr.press/v315/navet26a.html}, abstract = {Vision Transformers (ViTs) have recently been explored for structural MRI classification, motivated by their ability to capture non-local image structure. However, in limited and heterogeneous clinical cohorts, their weak inductive biases and sensitivity to training conditions often lead to high-variance behaviour. While binary settings such as cognitively normal vs. dementia are widely reported and typically exhibit moderate variability, we show that this stability does not extend to differential diagnosis. When increasing task complexity (e.g., controls vs. Alzheimer’s Disease vs. Frontotemporal Dementia), performance becomes sensitive to class imbalance and phenotype overlap, with greater variability driven by fewer samples per class, noisier labels, and increased inter-site heterogeneity. In this study, we investigate a stabilization protocol combining data augmentation, architectural constraints, and optimization strategies on multi-site MRI datasets. We assess how model variance evolves with task complexity using patient-level paired bootstrapping, calibration analysis, paired significance tests, and estimates of the probability of false outperformance to obtain uncertainty-aware comparisons across models. Our results highlight conditions under which Transformer-based classifiers can be consistently trained with limited neuroimaging data and illustrate that several performance gains disappear once stochastic variability is reported. These results emphasize that reliable differential diagnosis with ViTs requires both robust stabilization protocols to mitigate optimization noise and standardized uncertainty quantification beyond simple point-estimates.} }
Endnote
%0 Conference Paper %T On the Stability and Robustness of Vision Transformers for Neurodegenerative Disease Classification %A Eloi Navet %A Rémi Giraud %A Boris Mansencal %A Pierrick Coupé %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-navet26a %I PMLR %P 4518--4554 %U https://proceedings.mlr.press/v315/navet26a.html %V 315 %X Vision Transformers (ViTs) have recently been explored for structural MRI classification, motivated by their ability to capture non-local image structure. However, in limited and heterogeneous clinical cohorts, their weak inductive biases and sensitivity to training conditions often lead to high-variance behaviour. While binary settings such as cognitively normal vs. dementia are widely reported and typically exhibit moderate variability, we show that this stability does not extend to differential diagnosis. When increasing task complexity (e.g., controls vs. Alzheimer’s Disease vs. Frontotemporal Dementia), performance becomes sensitive to class imbalance and phenotype overlap, with greater variability driven by fewer samples per class, noisier labels, and increased inter-site heterogeneity. In this study, we investigate a stabilization protocol combining data augmentation, architectural constraints, and optimization strategies on multi-site MRI datasets. We assess how model variance evolves with task complexity using patient-level paired bootstrapping, calibration analysis, paired significance tests, and estimates of the probability of false outperformance to obtain uncertainty-aware comparisons across models. Our results highlight conditions under which Transformer-based classifiers can be consistently trained with limited neuroimaging data and illustrate that several performance gains disappear once stochastic variability is reported. These results emphasize that reliable differential diagnosis with ViTs requires both robust stabilization protocols to mitigate optimization noise and standardized uncertainty quantification beyond simple point-estimates.
APA
Navet, E., Giraud, R., Mansencal, B. & Coupé, P.. (2026). On the Stability and Robustness of Vision Transformers for Neurodegenerative Disease Classification. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:4518-4554 Available from https://proceedings.mlr.press/v315/navet26a.html.

Related Material