Why Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the OOD detector rankings

Kamil Szyc, Tomasz Walkowiak, Henryk Maciejewski
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:2078-2088, 2023.

Abstract

Reliable detection of out-of-distribution (OOD) instances is becoming a critical requirement for machine learning systems deployed in safety-critical applications. Recently, many OOD detectors have been developed in the literature, and their performance has been evaluated using empirical studies based on well-established benchmark datasets. However, these studies do not provide a conclusive recommendation because the performance of OOD detection depends on the benchmark datasets. In this work, we want to question the reliability of the OOD detection performance numbers obtained from many of these empirical experiments. We report several experimental conditions that are not controlled and lead to significant changes in OOD detector performance and rankings of OOD methods. These include the technicalities related to how the DNN was trained (such as seed, train/test split, etc.), which do not change the accuracy of closed-set DNN models but may significantly change the performance of OOD detection methods that rely on representation from these DNNs. We performed extensive sensitivity studies in image and text domains to quantify the instability of OOD performance measures due to unintuitive experimental factors. These factors need to be more rigorously controlled and accounted for in many current OOD experiments. Experimental studies in OOD detection should improve methodological standards regarding experiment control and replication.

Cite this Paper


BibTeX
@InProceedings{pmlr-v216-szyc23a, title = {Why Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the {OOD} detector rankings}, author = {Szyc, Kamil and Walkowiak, Tomasz and Maciejewski, Henryk}, booktitle = {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence}, pages = {2078--2088}, year = {2023}, editor = {Evans, Robin J. and Shpitser, Ilya}, volume = {216}, series = {Proceedings of Machine Learning Research}, month = {31 Jul--04 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v216/szyc23a/szyc23a.pdf}, url = {https://proceedings.mlr.press/v216/szyc23a.html}, abstract = {Reliable detection of out-of-distribution (OOD) instances is becoming a critical requirement for machine learning systems deployed in safety-critical applications. Recently, many OOD detectors have been developed in the literature, and their performance has been evaluated using empirical studies based on well-established benchmark datasets. However, these studies do not provide a conclusive recommendation because the performance of OOD detection depends on the benchmark datasets. In this work, we want to question the reliability of the OOD detection performance numbers obtained from many of these empirical experiments. We report several experimental conditions that are not controlled and lead to significant changes in OOD detector performance and rankings of OOD methods. These include the technicalities related to how the DNN was trained (such as seed, train/test split, etc.), which do not change the accuracy of closed-set DNN models but may significantly change the performance of OOD detection methods that rely on representation from these DNNs. We performed extensive sensitivity studies in image and text domains to quantify the instability of OOD performance measures due to unintuitive experimental factors. These factors need to be more rigorously controlled and accounted for in many current OOD experiments. Experimental studies in OOD detection should improve methodological standards regarding experiment control and replication.} }
Endnote
%0 Conference Paper %T Why Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the OOD detector rankings %A Kamil Szyc %A Tomasz Walkowiak %A Henryk Maciejewski %B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2023 %E Robin J. Evans %E Ilya Shpitser %F pmlr-v216-szyc23a %I PMLR %P 2078--2088 %U https://proceedings.mlr.press/v216/szyc23a.html %V 216 %X Reliable detection of out-of-distribution (OOD) instances is becoming a critical requirement for machine learning systems deployed in safety-critical applications. Recently, many OOD detectors have been developed in the literature, and their performance has been evaluated using empirical studies based on well-established benchmark datasets. However, these studies do not provide a conclusive recommendation because the performance of OOD detection depends on the benchmark datasets. In this work, we want to question the reliability of the OOD detection performance numbers obtained from many of these empirical experiments. We report several experimental conditions that are not controlled and lead to significant changes in OOD detector performance and rankings of OOD methods. These include the technicalities related to how the DNN was trained (such as seed, train/test split, etc.), which do not change the accuracy of closed-set DNN models but may significantly change the performance of OOD detection methods that rely on representation from these DNNs. We performed extensive sensitivity studies in image and text domains to quantify the instability of OOD performance measures due to unintuitive experimental factors. These factors need to be more rigorously controlled and accounted for in many current OOD experiments. Experimental studies in OOD detection should improve methodological standards regarding experiment control and replication.
APA
Szyc, K., Walkowiak, T. & Maciejewski, H.. (2023). Why Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the OOD detector rankings. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:2078-2088 Available from https://proceedings.mlr.press/v216/szyc23a.html.

Related Material