Testing the Trust: Verification and Validation of Bayesian Segmentation under Uncertainty

Giuseppina Carannante; Nidhal C. Bouaynaya; Dimah Dera; Hassan M. Fathallah-Shaykh; Ghulam Rasool

Testing the Trust: Verification and Validation of Bayesian Segmentation under Uncertainty

Giuseppina Carannante, Nidhal C. Bouaynaya, Dimah Dera, Hassan M. Fathallah-Shaykh, Ghulam Rasool

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4217-4239, 2026.

Abstract

Deep learning has achieved state-of-the-art performance in medical image segmentation, yet safe clinical deployment requires rigorous verification and validation of model robustness, reliability, and uncertainty behavior. Bayesian segmentation methods are often viewed as more trustworthy because they provide uncertainty estimates that can support human decision-making, flag unreliable predictions, and mitigate risks in downstream clinical workflows. However, most prior studies evaluate these models primarily on clean test data, with limited assessment of robustness to perturbations, and without examining whether the predicted uncertainty meaningfully correlates with segmentation quality. In this work, we conduct a comprehensive and systematic evaluation of state-of-the-art deterministic and Bayesian segmentation models across multiple datasets, corruption types, and performance metrics. Beyond accuracy-based metrics such as DSC and HD95, we analyze over- and under-segmentation trends, predictive variance, and the relationship between uncertainty and segmentation correctness. Our results show that while all models behave similarly on clean or mildly corrupted data, performance diverges significantly as perturbations increase. Models that learn and propagate uncertainty during training tend to exhibit improved robustness under severe perturbations and uncertainty estimates that better correlate with segmentation errors, suggesting potential advantages for safety-critical deployment.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-carannante26a,
  title = 	 {Testing the Trust: Verification and Validation of Bayesian Segmentation under Uncertainty},
  author =       {Carannante, Giuseppina and Bouaynaya, Nidhal C. and Dera, Dimah and Fathallah-Shaykh, Hassan M. and Rasool, Ghulam},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {4217--4239},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/carannante26a/carannante26a.pdf},
  url = 	 {https://proceedings.mlr.press/v315/carannante26a.html},
  abstract = 	 {Deep learning has achieved state-of-the-art performance in medical image segmentation, yet safe clinical deployment requires rigorous verification and validation of model robustness, reliability, and uncertainty behavior. Bayesian segmentation methods are often viewed as more trustworthy because they provide uncertainty estimates that can support human decision-making, flag unreliable predictions, and mitigate risks in downstream clinical workflows. However, most prior studies evaluate these models primarily on clean test data, with limited assessment of robustness to perturbations, and without examining whether the predicted uncertainty meaningfully correlates with segmentation quality. In this work, we conduct a comprehensive and systematic evaluation of state-of-the-art deterministic and Bayesian segmentation models across multiple datasets, corruption types, and performance metrics. Beyond accuracy-based metrics such as DSC and HD95, we analyze over- and under-segmentation trends, predictive variance, and the relationship between uncertainty and segmentation correctness. Our results show that while all models behave similarly on clean or mildly corrupted data, performance diverges significantly as perturbations increase. Models that learn and propagate uncertainty during training tend to exhibit improved robustness under severe perturbations and uncertainty estimates that better correlate with segmentation errors, suggesting potential advantages for safety-critical deployment.}
}

Endnote

%0 Conference Paper
%T Testing the Trust: Verification and Validation of Bayesian Segmentation under Uncertainty
%A Giuseppina Carannante
%A Nidhal C. Bouaynaya
%A Dimah Dera
%A Hassan M. Fathallah-Shaykh
%A Ghulam Rasool
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-carannante26a
%I PMLR
%P 4217--4239
%U https://proceedings.mlr.press/v315/carannante26a.html
%V 315
%X Deep learning has achieved state-of-the-art performance in medical image segmentation, yet safe clinical deployment requires rigorous verification and validation of model robustness, reliability, and uncertainty behavior. Bayesian segmentation methods are often viewed as more trustworthy because they provide uncertainty estimates that can support human decision-making, flag unreliable predictions, and mitigate risks in downstream clinical workflows. However, most prior studies evaluate these models primarily on clean test data, with limited assessment of robustness to perturbations, and without examining whether the predicted uncertainty meaningfully correlates with segmentation quality. In this work, we conduct a comprehensive and systematic evaluation of state-of-the-art deterministic and Bayesian segmentation models across multiple datasets, corruption types, and performance metrics. Beyond accuracy-based metrics such as DSC and HD95, we analyze over- and under-segmentation trends, predictive variance, and the relationship between uncertainty and segmentation correctness. Our results show that while all models behave similarly on clean or mildly corrupted data, performance diverges significantly as perturbations increase. Models that learn and propagate uncertainty during training tend to exhibit improved robustness under severe perturbations and uncertainty estimates that better correlate with segmentation errors, suggesting potential advantages for safety-critical deployment.

APA

Carannante, G., Bouaynaya, N.C., Dera, D., Fathallah-Shaykh, H.M. & Rasool, G.. (2026). Testing the Trust: Verification and Validation of Bayesian Segmentation under Uncertainty. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:4217-4239 Available from https://proceedings.mlr.press/v315/carannante26a.html.

Related Material

Download PDF