Reproducibility of Training Deep Learning Models for Medical Image Analysis

Joeran Bosma, Dre Peeters, Nat’alia Alves, Anindo Saha, Zaigham Saghir, Colin Jacobs, Henkjan Huisman
Medical Imaging with Deep Learning, PMLR 227:1269-1287, 2024.

Abstract

Performance of deep learning algorithms varies due to their development data and training method, but also due to several stochastic processes during training. Due to these random factors, a single training run may not accurately reflect the performance of a given training method. Statistical comparisons in literature between different deep learning training methods typically ignore this performance variation between training runs and incorrectly claim significance of changes in training method. We hypothesize that the impact of such performance variation is substantial, such that it may invalidate biomedical competition leaderboards and some scientific papers. To test this, we investigate the reproducibility of training deep learning algorithms for medical image analysis. We repeated training runs from prior scientific studies: three diagnostic tasks (pancreatic cancer detection in CT, clinically significant prostate cancer detection in MRI, and lung nodule malignancy risk estimation in low-dose CT) and two organ segmentation tasks (pancreas segmentation in CT and prostate segmentation in MRI). A previously published top-performing algorithm for each task was trained multiple times to determine the variance in model performance. For all three diagnostic algorithms, performance variation from retraining was significant compared to data variance. Statistically comparing independently trained algorithms from the same training method using the same dataset should follow the null hypothesis, but we observed claimed significance with a p-value below 0.05 in $15%$ of comparisons with conventional testing (paired bootstrapping). We conclude that variance in model performance due to retraining is substantial and should be accounted for.

Cite this Paper


BibTeX
@InProceedings{pmlr-v227-bosma24a, title = {Reproducibility of Training Deep Learning Models for Medical Image Analysis}, author = {Bosma, Joeran and Peeters, Dre and Alves, Nat'alia and Saha, Anindo and Saghir, Zaigham and Jacobs, Colin and Huisman, Henkjan}, booktitle = {Medical Imaging with Deep Learning}, pages = {1269--1287}, year = {2024}, editor = {Oguz, Ipek and Noble, Jack and Li, Xiaoxiao and Styner, Martin and Baumgartner, Christian and Rusu, Mirabela and Heinmann, Tobias and Kontos, Despina and Landman, Bennett and Dawant, Benoit}, volume = {227}, series = {Proceedings of Machine Learning Research}, month = {10--12 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v227/bosma24a/bosma24a.pdf}, url = {https://proceedings.mlr.press/v227/bosma24a.html}, abstract = {Performance of deep learning algorithms varies due to their development data and training method, but also due to several stochastic processes during training. Due to these random factors, a single training run may not accurately reflect the performance of a given training method. Statistical comparisons in literature between different deep learning training methods typically ignore this performance variation between training runs and incorrectly claim significance of changes in training method. We hypothesize that the impact of such performance variation is substantial, such that it may invalidate biomedical competition leaderboards and some scientific papers. To test this, we investigate the reproducibility of training deep learning algorithms for medical image analysis. We repeated training runs from prior scientific studies: three diagnostic tasks (pancreatic cancer detection in CT, clinically significant prostate cancer detection in MRI, and lung nodule malignancy risk estimation in low-dose CT) and two organ segmentation tasks (pancreas segmentation in CT and prostate segmentation in MRI). A previously published top-performing algorithm for each task was trained multiple times to determine the variance in model performance. For all three diagnostic algorithms, performance variation from retraining was significant compared to data variance. Statistically comparing independently trained algorithms from the same training method using the same dataset should follow the null hypothesis, but we observed claimed significance with a p-value below 0.05 in $15%$ of comparisons with conventional testing (paired bootstrapping). We conclude that variance in model performance due to retraining is substantial and should be accounted for.} }
Endnote
%0 Conference Paper %T Reproducibility of Training Deep Learning Models for Medical Image Analysis %A Joeran Bosma %A Dre Peeters %A Nat’alia Alves %A Anindo Saha %A Zaigham Saghir %A Colin Jacobs %A Henkjan Huisman %B Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2024 %E Ipek Oguz %E Jack Noble %E Xiaoxiao Li %E Martin Styner %E Christian Baumgartner %E Mirabela Rusu %E Tobias Heinmann %E Despina Kontos %E Bennett Landman %E Benoit Dawant %F pmlr-v227-bosma24a %I PMLR %P 1269--1287 %U https://proceedings.mlr.press/v227/bosma24a.html %V 227 %X Performance of deep learning algorithms varies due to their development data and training method, but also due to several stochastic processes during training. Due to these random factors, a single training run may not accurately reflect the performance of a given training method. Statistical comparisons in literature between different deep learning training methods typically ignore this performance variation between training runs and incorrectly claim significance of changes in training method. We hypothesize that the impact of such performance variation is substantial, such that it may invalidate biomedical competition leaderboards and some scientific papers. To test this, we investigate the reproducibility of training deep learning algorithms for medical image analysis. We repeated training runs from prior scientific studies: three diagnostic tasks (pancreatic cancer detection in CT, clinically significant prostate cancer detection in MRI, and lung nodule malignancy risk estimation in low-dose CT) and two organ segmentation tasks (pancreas segmentation in CT and prostate segmentation in MRI). A previously published top-performing algorithm for each task was trained multiple times to determine the variance in model performance. For all three diagnostic algorithms, performance variation from retraining was significant compared to data variance. Statistically comparing independently trained algorithms from the same training method using the same dataset should follow the null hypothesis, but we observed claimed significance with a p-value below 0.05 in $15%$ of comparisons with conventional testing (paired bootstrapping). We conclude that variance in model performance due to retraining is substantial and should be accounted for.
APA
Bosma, J., Peeters, D., Alves, N., Saha, A., Saghir, Z., Jacobs, C. & Huisman, H.. (2024). Reproducibility of Training Deep Learning Models for Medical Image Analysis. Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 227:1269-1287 Available from https://proceedings.mlr.press/v227/bosma24a.html.

Related Material